Marks: 60
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
The data contains the different data related to a food order. The detailed data dictionary is given below.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# set decimal places to 2
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# read the data
df = pd.read_csv('/content/drive/My Drive/Data Science/GL Projects/foodhub_order.csv')
# returns the first 5 rows
df.head()
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | Not given | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | Not given | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
df.shape
(1898, 9)
There are 1898 rows and 9 columns.
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1898 non-null object 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 dtypes: float64(1), int64(4), object(4) memory usage: 133.6+ KB
Columns of int64 (integer) type include: order_id, customer_id, food_preparation_time and delivery_time.
Columns of object (string) type include: restaurant_name, cuisine_type, day_of_the_week and rating.
cost_of_the_order column is type float64.
df.isnull().sum()
order_id 0 customer_id 0 restaurant_name 0 cuisine_type 0 cost_of_the_order 0 day_of_the_week 0 rating 0 food_preparation_time 0 delivery_time 0 dtype: int64
There are no missing values.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| order_id | 1898.00 | 1477495.50 | 548.05 | 1476547.00 | 1477021.25 | 1477495.50 | 1477969.75 | 1478444.00 |
| customer_id | 1898.00 | 171168.48 | 113698.14 | 1311.00 | 77787.75 | 128600.00 | 270525.00 | 405334.00 |
| cost_of_the_order | 1898.00 | 16.50 | 7.48 | 4.47 | 12.08 | 14.14 | 22.30 | 35.41 |
| food_preparation_time | 1898.00 | 27.37 | 4.63 | 20.00 | 23.00 | 27.00 | 31.00 | 35.00 |
| delivery_time | 1898.00 | 24.16 | 4.97 | 15.00 | 20.00 | 25.00 | 28.00 | 33.00 |
Minimum food preparation time = 20 minutes
Average food preparation time = 27 minutes
Maximum food preparation time = 35 minutes
df[df['rating']=='Not given']['rating'].count()
736
736 orders are not rated.
plt.title("Histogram - Food Preparation Time")
plt.xlabel("Food Preparation Time (minutes)")
sns.histplot(data=df, x='food_preparation_time', bins=8, kde=True);
plt.title("Box Plot - Food Preparation Time")
sns.boxplot(data=df, x='food_preparation_time')
plt.xlabel("Food Preparation Time (minutes)");
Observations: Food preparation time follows a relatively uniform distribution pattern - frequencies are similar across the bins. The median food preparation time is 27 minutes. Food preparation times range from 20 to 35 minutes. There are no outliers.
plt.title("Histogram - Delivery Time")
plt.xlabel("Delivery Time (minutes)")
sns.histplot(data=df, x='delivery_time', kde=True, bins=9);
plt.title("Box Plot - Delivery Time")
sns.boxplot(data=df, x='delivery_time')
plt.xlabel("Delivery Time (minutes)");
Observations: Delivery time follows an asymmetric distribution pattern which is somewhat skewed left. The median delivery time is 25 minutes. Delivery times range from 15 to 33 minutes. There are no outliers.
plt.title("Histogram - Order Cost")
plt.xlabel("Order Cost (dollars)")
sns.histplot(data=df, x='cost_of_the_order', kde=True, bins=12);
plt.title("Box Plot - Order Cost")
sns.boxplot(data=df, x='cost_of_the_order')
plt.xlabel("Order Cost (dollars)");
Observations: Order cost follows an asymmetric distribution pattern which is skewed right. The median cost is \$14.14. Order costs range from \$4.47 to $35.41. There are no outliers.
plt.figure(figsize=(20,5))
plt.title("Count Plot - Cuisine Type")
sns.countplot(data=df, x='cuisine_type', order = df['cuisine_type'].value_counts().index)
plt.xlabel("Cuisine Type")
plt.ylabel("Count");
Observations: The top 3 most commonly ordered cuisine types are American, Japanese and Italian. The least common are Vietnamese, Spanish and Korean.
plt.title("Count Plot - Day of the Week")
sns.countplot(data=df, x='day_of_the_week')
plt.xlabel("Day of the Week")
plt.ylabel("Count");
Observations: More orders are placed on the weekend compared to weekdays.
plt.title("Count Plot - Ratings")
sns.countplot(data=df, x='rating', order=['Not given','3','4','5'])
plt.xlabel("Rating")
plt.ylabel("Count");
Observations: The most common rating is 5. It is also common for no rating to be given.
df['restaurant_name'].value_counts().head()
Shake Shack 219 The Meatball Shop 132 Blue Ribbon Sushi 119 Blue Ribbon Fried Chicken 96 Parm 68 Name: restaurant_name, dtype: int64
The top 5 restaurants in terms of number of orders received are:
# Checking value_counts of cuisine_type for rows where day_of_the_week is 'Weekend'.
df[df['day_of_the_week']=='Weekend']['cuisine_type'].value_counts().head(1)
American 415 Name: cuisine_type, dtype: int64
The most popular cuisine on weekends is American.
# Index based on condition to select rows (orders) where cost is greater than 20 dollars
# Count the number of rows (orders) where cost is greater than 20 dollars with .shape attribute
# Divide by the total number of rows (orders) in the original DataFrame and multiply by 100 to get percent
df[df['cost_of_the_order']>20].shape[0]/df.shape[0]*100
29.24130663856691
29.2% of the orders cost more than 20 dollars.
# Use .mean() method to find the mean of a column/Series
df['delivery_time'].mean()
24.161749209694417
The mean order delivery time is 24 minutes.
# Use value_counts method to check the number of orders placed by each unique customer ID in order from greatest to least.
df['customer_id'].value_counts().head(3)
52832 13 47440 10 83287 9 Name: customer_id, dtype: int64
Top customer was ID #52832 who placed 13 orders.
2nd most frequent customer was ID #47440 who placed 10 orders.
3rd most frequent customer was ID #83287 who placed 9 orders.
# Check pairplot to look for relationships between numerical variables.
sns.pairplot(data=df[['food_preparation_time','delivery_time','cost_of_the_order']], corner=True);
# Calculate and visualize correlation coefficient for numerical variables using heatmap and .corr method.
sns.heatmap(data=df[['food_preparation_time','delivery_time','cost_of_the_order']].corr(), annot=True);
Observations: There is no significant correlation between delivery time, food preparation time or cost of the order.
# Use boxplot to compare cost distribution within different cuisine type categories.
plt.figure(figsize=(20,5))
sns.set(style='darkgrid')
plt.title("Box Plot - Cuisine Type vs. Cost")
sns.boxplot(data=df, x='cuisine_type',y='cost_of_the_order')
plt.xlabel("Cuisine Type")
plt.ylabel("Order Cost (dollars)");
Observations: French cuisine had the highest median cost. Vietnamese and Korean cuisine had the lowest median cost.
# Use boxplot to compare preparation time distribution within different cuisine type categories.
plt.figure(figsize=(20,5))
sns.set(style='darkgrid')
plt.title("Box Plot - Cuisine Type vs. Preparation Time")
sns.boxplot(data=df, x='cuisine_type',y='food_preparation_time')
plt.xlabel("Cuisine Type")
plt.ylabel("Food Preparation Time (minutes)");
Thai & Italian cuisine had the highest median food preparation time. Vietnamese and Korean cuisine had the lowest median food preparation time (this could explain why they also had the lowest median cost).
sns.set(style='white')
plt.title("Box Plot - Day of the Week vs. Delivery Time")
sns.boxplot(data=df, x='day_of_the_week', y='delivery_time')
plt.xlabel("Day of the Week")
plt.ylabel("Delivery Time (minutes)");
Observations: Median delivery time was higher on weekdays compared to weekends.
plt.title("Point Plot - Rating vs. Delivery Time")
sns.pointplot(data=df, x='rating', y='delivery_time', order=['Not given','3','4','5'])
plt.xlabel("Rating")
plt.ylabel("Delivery Time (minutes)");
Observations: Orders with lowest rating (3) had a higher delivery time on average.
plt.title("Point Plot - Rating vs. Food Preparation Time")
sns.pointplot(data=df, x='rating', y='food_preparation_time', order=['Not given','3','4','5'])
plt.xlabel("Rating")
plt.ylabel("Preparation Time (minutes)");
Observation: Average food preparation time was similar across different ratings.
plt.title("Point Plot - Rating vs. Order Cost")
sns.pointplot(data=df, x='rating', y='cost_of_the_order', order=['Not given','3','4','5'])
plt.xlabel("Rating")
plt.ylabel("Order Cost (dollars)");
Observations: Orders with a higher rating tended to have a higher cost on average.
# Create a new DataFrame excluding rows where 'rating' was 'Not given'.
df_ratings = df[df['rating']!='Not given'].copy()
# Change data type of 'rating' column to numeric type in order to perform arithmetic (mean) calculation.
df_ratings['rating'] = df_ratings['rating'].astype(float)
# Use groupby method to create a new DataFrame with restaurants' rating mean and rating count.
df_ratings_agg = df_ratings.groupby('restaurant_name')['rating'].agg(['mean','count']).reset_index()
df_ratings_agg
| restaurant_name | mean | count | |
|---|---|---|---|
| 0 | 'wichcraft | 5.00 | 1 |
| 1 | 12 Chairs | 4.50 | 2 |
| 2 | 5 Napkin Burger | 4.00 | 2 |
| 3 | 67 Burger | 5.00 | 1 |
| 4 | Amma | 4.50 | 2 |
| ... | ... | ... | ... |
| 151 | Zero Otto Nove | 4.00 | 1 |
| 152 | brgr | 3.00 | 1 |
| 153 | da Umberto | 5.00 | 1 |
| 154 | ilili Restaurant | 4.15 | 13 |
| 155 | indikitch | 4.50 | 2 |
156 rows × 3 columns
# Use logical indexing to display a list of the restaurants with mean rating above 4 and rating count more than 50.
df_ratings_agg[(df_ratings_agg['mean']>4)&(df_ratings_agg['count']>50)]['restaurant_name'].tolist()
['Blue Ribbon Fried Chicken', 'Blue Ribbon Sushi', 'Shake Shack', 'The Meatball Shop']
The following restaurants have a rating count of more than 50 and average rating greater than 4:
# Find the sum of 'cost_of_the_order' where cost is greater than 20 and multiply by 0.25 to get 25%.
# Find the sum of 'cost_of_the_order' where cost is greater than 5 and less than or equal to 20. Mutiply by 0.15 to get 15%.
# Add the two together to get net revenue across all orders.
df[df['cost_of_the_order']>20]['cost_of_the_order'].sum()*.25 + df[(df['cost_of_the_order']>5)&(df['cost_of_the_order']<=20)]['cost_of_the_order'].sum()*.15
6166.303
# Create a new column called 'total_time' which is the sum of food_preparation_time plus delivery_time.
# Use conditional indexing with .shape method to count the number of rows where total_time is greater than 60. Divide by original number of rows & multiply by 100 to get percent.
df['total_time'] = df['food_preparation_time'] + df['delivery_time']
df[df['total_time']>60].shape[0]/df.shape[0]*100
10.537407797681771
10.5% of orders take more than 60 minutes to get delivered from the time the order is placed.
# Find the difference between the mean delivery time on weekdays vs. weekends using logical indexing.
df[df['day_of_the_week']=='Weekday']['delivery_time'].mean()-df[df['day_of_the_week']=='Weekend']['delivery_time'].mean()
5.870014357297798
The mean delivery time is about 6 minutes higher on weekdays compared to weekends.
Marks: 60
Problem Statement:
E-news Express is an online news portal whose goal is to increase subscribers and drive engagement. The design team of the company has researched and created a new landing page with a new outline & more relevant content shown compared to the old page. To test the effectiveness of the new landing page, the data science team randomly selected 100 users and divided them equally into two groups, where one group was shown the old page and one group was shown the new landing page. Data was collected to analyze how users interacted with each page.
Objectives:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import scipy.stats as stats
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
data = pd.read_csv('/content/drive/My Drive/Data Science/GL Projects/abtest.csv')
# Display the first 5 and last 5 rows of the data
data
| user_id | group | landing_page | time_spent_on_the_page | converted | language_preferred | |
|---|---|---|---|---|---|---|
| 0 | 546592 | control | old | 3.48 | no | Spanish |
| 1 | 546468 | treatment | new | 7.13 | yes | English |
| 2 | 546462 | treatment | new | 4.40 | no | Spanish |
| 3 | 546567 | control | old | 3.02 | no | French |
| 4 | 546459 | treatment | new | 4.75 | yes | Spanish |
| ... | ... | ... | ... | ... | ... | ... |
| 95 | 546446 | treatment | new | 5.15 | no | Spanish |
| 96 | 546544 | control | old | 6.52 | yes | English |
| 97 | 546472 | treatment | new | 7.07 | yes | Spanish |
| 98 | 546481 | treatment | new | 6.20 | yes | Spanish |
| 99 | 546483 | treatment | new | 5.86 | yes | English |
100 rows × 6 columns
# Use .shape attribute to return number of rows & columns in the data
print(f'There are {data.shape[0]} rows & {data.shape[1]} columns.')
There are 100 rows & 6 columns.
# Use .describe() method to display statistical summary of numerical variable (drop user_id column as it is not numerical)
data.describe().T.drop('user_id')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| time_spent_on_the_page | 100.0 | 5.3778 | 2.378166 | 0.19 | 3.88 | 5.415 | 7.0225 | 10.71 |
Time spent on the page ranges from 0.19 minutes (11.4 seconds) to 10.7 minutes. Average time spent on the page is 5.4 minutes with standard deviation of 2.4 minutes. Median time spend on the page is 5.4 minutes.
# Use include='object' parameter to display summary of categorical variables
data.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| group | 100 | 2 | control | 50 |
| landing_page | 100 | 2 | old | 50 |
| converted | 100 | 2 | yes | 54 |
| language_preferred | 100 | 3 | Spanish | 34 |
There are 2 groups of equal frequency (one for each landing page). There are 3 different preferred languages, the most common being Spanish.
# Use .info() method to check non-null counts and column datatypes
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 100 non-null int64 1 group 100 non-null object 2 landing_page 100 non-null object 3 time_spent_on_the_page 100 non-null float64 4 converted 100 non-null object 5 language_preferred 100 non-null object dtypes: float64(1), int64(1), object(4) memory usage: 4.8+ KB
There are 2 columns with numerical datatypes: user_id and time_spent_on_the_page
The rest of the columns are string/object types: group, landing_page, converted, & language_preferred
# Use .isnull() and .sum() methods to verify there are no missing (null) values
data.isnull().sum()
user_id 0 group 0 landing_page 0 time_spent_on_the_page 0 converted 0 language_preferred 0 dtype: int64
There are no missing values in the data.
# Use .duplicated() and .sum() methods to check for duplicated data
data.duplicated().sum()
0
There are no duplicates in the data.
# Use histogram to visualize numerical variable
sns.set_palette("Set1")
sns.histplot(data, x='time_spent_on_the_page', kde=True)
plt.title('Histogram: Time Spent on Page')
plt.xlabel('Time Spent on Page');
Time spent on the page appears to follow a normal distribution pattern.
# Use boxplot to visualize numerical variable
sns.boxplot(data=data, x='time_spent_on_the_page')
plt.title('Box Plot: Time Spent on Page')
plt.xlabel('Time Spent on Page');
The box plot confirms that time spent on the page is not skewed and there are no outliers.
# Use countplot to visualize categorical variables
sns.countplot(data=data, x='group')
plt.title('Countplot: Group')
plt.xlabel('Group')
plt.ylabel('Count');
The number of users in the control group is equal to the number of users in the treatment group (50 in each group).
sns.countplot(data=data, x='landing_page')
plt.title('Countplot: Landing Page')
plt.xlabel('Landing Page')
plt.ylabel('Count');
The number of users directed to the old landing page (control group) is equal to the number of users directed to the new landing page (treatment group).
sns.countplot(data=data, x='converted')
plt.title('Countplot: Converted')
plt.xlabel('Converted')
plt.ylabel('Count');
More users were converted compared to not-converted.
sns.countplot(data=data.sort_values('language_preferred'), x='language_preferred')
plt.title('Countplot: Language Preferred')
plt.xlabel('Language Preferred')
plt.ylabel('Count');
The number of users preferring Spanish and French are equal, and both are slightly more common than users preferring English.
sns.boxplot(data=data, x='landing_page', y='time_spent_on_the_page')
plt.title('Box Plot: Time Spent on Page for Old vs. New')
plt.xlabel('Landing Page')
plt.ylabel('Time Spent on Page');
It appears that users spend more time on the new landing page compared to the old landing page. There are some high and low outliers for time spent on the new landing page. There is less variation in time spent on the new page compared to the old page.
sns.boxplot(data=data, x='converted', y='time_spent_on_the_page')
plt.title('Box Plot: Time Spent on Page for Conversion')
plt.xlabel('Converted')
plt.ylabel('Time Spent on Page');
It appears that users who are converted spend more time on the landing page with less variation compared to users who are not converted. There are some outliers in both categories.
sns.boxplot(data=data, x='language_preferred', y='time_spent_on_the_page')
plt.title('Box Plot: Time Spent on Page with Preferred Language')
plt.xlabel('Preferred Language')
plt.ylabel('Time Spent on Page');
The time spent on the page appears to be similar across the preferred languages. The variation is smaller for users who prefer Spanish, with one low outlier.
sns.countplot(data=data, x='landing_page', hue='converted')
plt.title('Count Plot: Conversion based on Landing Page')
plt.xlabel('Landing Page')
plt.ylabel('Count');
It appears that users who visited the new landing page were more likely to be converted than not, while users who visited the old landing page were less likely to be converted.
sns.countplot(data=data, x='language_preferred', hue='converted')
plt.title('Count Plot: Conversion based on Language')
plt.xlabel('Preferred Language')
plt.ylabel('Count');
It appears that users who prefer English were substantially more likely to be converted than not, while users preferring French were less likely to be converted. Conversion counts were similar across users who prefer Spanish.
sns.boxplot(data=data, x='landing_page', y='time_spent_on_the_page')
plt.title('Box Plot: Time Spent on Old vs. New Page')
plt.xlabel('Landing Page')
plt.ylabel('Time Spent on Page');
It appears that users spend more time on the new landing page compared to the old page.
Null hypothesis: average time spent on new landing page is equal to time spent on old landing page
Alternate hypothesis: average time spent on new landing page is greater than time spent on old landing page
Since we are comparing means from two independent samples with unknown population standard deviation, the student's t-test is appropriate. The data is continuous, normally distributed and randomly sampled so the assumptions for the t-test are met.
Significance level of 0.05 will provide 95% certainty of avoiding Type I error (false positive).
# Create new DataFrames to separate data for old landing page vs. new landing page. Calculate mean and standard deviation for each.
data_old = data[data['landing_page']=='old'].reset_index(drop=True)
data_new = data[data['landing_page']=='new'].reset_index(drop=True)
print(f'The average time spent on the old landing page is {round(data_old["time_spent_on_the_page"].mean(),2)} minutes with standard deviation of {round(data_old["time_spent_on_the_page"].std(),2)} minutes')
print(f'The average time spent on the new landing page is {round(data_new["time_spent_on_the_page"].mean(),2)} minutes with standard deviation of {round(data_new["time_spent_on_the_page"].std(),2)} minutes')
The average time spent on the old landing page is 4.53 minutes with standard deviation of 2.58 minutes The average time spent on the new landing page is 6.22 minutes with standard deviation of 1.82 minutes
The sample standard deviations are unequal.
# Use independent t-test function with equal_var=False to indicate that sample standard deviations are not equal.
from scipy.stats import ttest_ind
test_stat, p_value = ttest_ind(data_new['time_spent_on_the_page'],data_old['time_spent_on_the_page'], equal_var=False, alternative='greater')
print(f'The p-value is {p_value}')
The p-value is 0.0001392381225166549
The p-value is less than the level of significance (alpha = 0.05). Therefore, there is enough evidence to reject the null hypothesis.
The null hypothesis is rejected, which means we can be 95% certain that the time spent on the old and new page is NOT equal; time spent on the new page is greater.
# Create a new DataFrame to count conversions
conversions = pd.DataFrame({
'Page':['New page','Old page'],
'Converted':
[data_new['converted'][data_new['converted']=='yes'].count(),
data_old['converted'][data_old['converted']=='yes'].count()],
'Not converted':
[data_new['converted'][data_new['converted']=='no'].count(),
data_old['converted'][data_old['converted']=='no'].count()],
'Sample size':[data_new.shape[0],data_old.shape[0]]})
conversions
| Page | Converted | Not converted | Sample size | |
|---|---|---|---|---|
| 0 | New page | 33 | 17 | 50 |
| 1 | Old page | 21 | 29 | 50 |
# Add a new column to calculate conversion rate (number of users converted out of each sample)
conversions['Conversion Rate'] = conversions['Converted']/conversions['Sample size']
conversions
| Page | Converted | Not converted | Sample size | Conversion Rate | |
|---|---|---|---|---|---|
| 0 | New page | 33 | 17 | 50 | 0.66 |
| 1 | Old page | 21 | 29 | 50 | 0.42 |
# Use a bar plot to visualize difference between conversion rate for old page vs. new page
sns.catplot(data=conversions, x='Page', y='Conversion Rate', kind='bar')
plt.title('Bar Plot: Conversion Rate for New Page vs. Old Page');
It appears that the conversion rate is higher for the new page compared to the old page.
Null hypothesis: the conversion rates are equal for the old page and the new page
Alternative hypothesis: the conversion rate for the new page is higher than the old page
Since we are comparing proportions from 2 independent samples, the 2-sample z-test is the appropriate test. The following assumptions for the 2-sample z-test are met:
Significance level of 0.05 will provide 95% certainty of avoiding Type I error (false positive).
# Use NumPy arrays of conversion counts and number of observations in each category to execute the proportions_ztest function
conversion_count = np.array(conversions['Converted'])
nobs = np.array(conversions['Sample size'])
from statsmodels.stats.proportion import proportions_ztest
test_stat, p_value = proportions_ztest(conversion_count, nobs)
print(f'The p-value is {p_value}')
The p-value is 0.016052616408112556
The p-value is less than the level of significance of 0.05, therefore there is enough evidence to reject the null hypothesis and conclude that the conversion rate for the new page is in fact higher than the conversion rate for the old page.
# Create new DataFrames to separate data by preferred language
data_eng = data[data['language_preferred']=='English']
data_fren = data[data['language_preferred']=='French']
data_span = data[data['language_preferred']=='Spanish']
# Create a contingency table for conversion and preferred language
lang_table = pd.DataFrame({
'Language':['English','French','Spanish'],
'Converted':
[data_eng['converted'][data_eng['converted']=='yes'].count(),
data_fren['converted'][data_fren['converted']=='yes'].count(),
data_span['converted'][data_span['converted']=='yes'].count()],
'Not converted':
[data_eng['converted'][data_eng['converted']=='no'].count(),
data_fren['converted'][data_fren['converted']=='no'].count(),
data_span['converted'][data_span['converted']=='no'].count()],
'Sample size':[data_eng.shape[0],data_fren.shape[0],data_span.shape[0]]})
lang_table
| Language | Converted | Not converted | Sample size | |
|---|---|---|---|---|
| 0 | English | 21 | 11 | 32 |
| 1 | French | 15 | 19 | 34 |
| 2 | Spanish | 18 | 16 | 34 |
# Add a column with the conversion rate for each language
lang_table['Conversion rate'] = round(lang_table['Converted']/lang_table['Sample size'],2)
lang_table
| Language | Converted | Not converted | Sample size | Conversion rate | |
|---|---|---|---|---|---|
| 0 | English | 21 | 11 | 32 | 0.66 |
| 1 | French | 15 | 19 | 34 | 0.44 |
| 2 | Spanish | 18 | 16 | 34 | 0.53 |
# Use a bar plot to visualize conversion rates for each language
sns.catplot(data=lang_table.sort_values('Conversion rate'), x='Language', y='Conversion rate', kind='bar')
plt.title('Bar Plot: Conversion Rate for Preferred Languages');
It appears that the conversion rate is highest for users who prefer English and lowest for users who prefer French. The conversion rate for users who prefer Spanish is in the middle.
Null hypothesis: preferred language and conversion rate are independent
Alternate hypothesis: preferred language and conversion rate are not independent
Since we want to know if 2 categorical variables are independent, the chi-squared test is appropriate. There are more than 5 observations for each category and the samples are random, so the necessary assumptions are met.
Significance level of 0.05 will provide 95% certainty of avoiding Type I error (false positive).
# Use chi2_contingency function to find the p-value from the contingency table
from scipy.stats import chi2_contingency
chi, p_value, dof, expected = chi2_contingency(lang_table.drop(['Language','Sample size','Conversion rate'], axis = 1))
print(f'The p-value is {p_value}')
The p-value is 0.2129888748754345
Since the p-value is greater than the level of significance (0.05), there is NOT enough evidence to reject the null hypothesis. Therefore we maintain that preferred language and conversion rate are independent.
# Use indexing of separate DataFrames by language with .mean() method to find average time spent on the new landing page
eng_time = round(data_eng[data_eng["landing_page"]=="new"]["time_spent_on_the_page"].mean(),2)
fren_time = round(data_fren[data_fren["landing_page"]=="new"]["time_spent_on_the_page"].mean(),2)
span_time = round(data_span[data_span["landing_page"]=="new"]["time_spent_on_the_page"].mean(),2)
print(f'The mean time spent on the new page by users preferring English was {eng_time} minutes')
print(f'The mean time spent on the new page by users preferring French was {fren_time} minutes')
print(f'The mean time spent on the new page by users preferring Spanish was {span_time} minutes')
The mean time spent on the new page by users preferring English was 6.66 minutes The mean time spent on the new page by users preferring French was 6.2 minutes The mean time spent on the new page by users preferring Spanish was 5.84 minutes
# Create a dataframe with mean time spent on new page by preferred language for visualization
time_df = pd.DataFrame({'Language':['English','French','Spanish'],'Mean Time':[eng_time,fren_time,span_time]})
time_df
| Language | Mean Time | |
|---|---|---|
| 0 | English | 6.66 |
| 1 | French | 6.20 |
| 2 | Spanish | 5.84 |
# Use a bar plot to visualize mean time spent on new page for each language
sns.catplot(data=time_df.sort_values('Mean Time'), x='Language', y='Mean Time', kind='bar')
plt.title('Bar Plot: Mean Time Spent on New Page')
plt.ylabel('Mean time (minutes)');
The time spent on the new page appears to be highest for users who prefer English and lowest for users who prefer Spanish. The time spent by users who prefer French is in the middle.
Null hypothesis: the mean time spent on the new page is equal across the different preferred languages
Alternate hypothesis: the mean time spent on the new page is different across different preferred languages
Since we are comparing the sample means across more than 2 independent populations, the ANOVA test may be appropriate. The Shapiro-Wilk and Levene's tests can be used to verify the assumptions of normality and variance (respectively) for the ANOVA test.
For the Shapiro-Wilk test of normality, the null hypothesis is that time spent on the new page follows a normal distribution. The alternate hypothesis is that time spent on the new page does not follow a normal distribution.
w, p_value = stats.shapiro(data_new['time_spent_on_the_page'])
print(f'The p-value is {p_value}')
The p-value is 0.8040016293525696
The p-value for the Shapiro-Wilk test is greater than 0.05, so we fail to reject the null and therefore maintain that time spent on the new page follows a normal distribution, satisfying the assumption for the ANOVA test.
For Levene's test, the null hypothesis is that all the population variances are equal. The alternate hypothesis is that the population variances are unequal.
from scipy.stats import levene
statistic, p_value = levene(
data_new['time_spent_on_the_page'][data_new['language_preferred']=="English"],
data_new['time_spent_on_the_page'][data_new['language_preferred']=="French"],
data_new['time_spent_on_the_page'][data_new['language_preferred']=="Spanish"])
print(f"The p-value for Levene's test is {p_value}")
The p-value for Levene's test is 0.46711357711340173
The p-value for Levene's test is greater than the level of significance (0.05), therefore the null cannot be rejected and we maintain that the population variances are equal, which satisfies the assumption for the ANOVA test.
# Perform the ANOVA test using f_oneway function
from scipy.stats import f_oneway
test_stat, p_value = f_oneway(
data_new.loc[data_new['language_preferred'] == 'English', 'time_spent_on_the_page'],
data_new.loc[data_new['language_preferred'] == 'French', 'time_spent_on_the_page'],
data_new.loc[data_new['language_preferred'] == 'Spanish', 'time_spent_on_the_page'])
print(f'The p-value for the ANOVA test is {p_value}')
The p-value for the ANOVA test is 0.43204138694325955
The p-value for the ANOVA test is greater than the level of significance (0.05) which means there is not enough evidence to reject the null hypothesis. Therefore it is maintained that the time spent on the new page is equal across the different preferred languages.
Summary of evidence:
Recommendations:
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
# Import libraries for data manipulation.
import pandas as pd
import numpy as np
# Import libraries for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Import libraries for linear regression.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pylab
import scipy.stats as stats
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
# set decimal places to 2
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
data = pd.read_csv('/content/drive/My Drive/Data Science/Current/used_device_data.csv')
# Use .shape attribute to display the number of rows & columns in the data set.
data.shape
(3454, 15)
The data set has 3454 rows and 15 columns.
# Display a sample of 5 rows from the data set to get a general idea of the information & make sure it is loaded properly.
data.sample(5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 472 | Alcatel | Android | 15.24 | yes | no | 8.00 | 5.00 | 16.00 | 4.00 | 3000.00 | 155.00 | 2018 | 586 | 3.95 | 5.07 |
| 719 | Asus | Android | 18.01 | yes | no | 13.00 | 5.00 | 64.00 | 4.00 | 4680.00 | 168.00 | 2017 | 847 | 4.84 | 5.70 |
| 1719 | LG | Android | 12.70 | no | no | 8.00 | 1.00 | 32.00 | 4.00 | 2540.00 | 137.00 | 2014 | 646 | 3.82 | 4.62 |
| 696 | Others | Android | 10.29 | yes | no | 5.00 | 0.30 | 16.00 | 4.00 | 1850.00 | 140.00 | 2014 | 691 | 3.82 | 5.44 |
| 1358 | Huawei | Android | 17.78 | yes | no | 13.00 | 5.00 | 16.00 | 4.00 | 5000.00 | 239.00 | 2015 | 700 | 4.59 | 5.92 |
# Use .info() method to check the non-null counts and data types of each column.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
Columns with object/string (categorical) type: brand name, os, 4g, 5g
Columns with float/int (numeric) type: screen size, main camera MP, selfie camera MP, internal memory, RAM, battery, weight, release year, days used, normalized used price, normalized new price
There are several columns with missing (null) values - these will be addressed later.
# Check the statistical summary for the numeric columns, dropping release_year as it is not numeric.
data.describe().T.drop('release_year')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.00 | 13.71 | 3.81 | 5.08 | 12.70 | 12.83 | 15.34 | 30.71 |
| main_camera_mp | 3275.00 | 9.46 | 4.82 | 0.08 | 5.00 | 8.00 | 13.00 | 48.00 |
| selfie_camera_mp | 3452.00 | 6.55 | 6.97 | 0.00 | 2.00 | 5.00 | 8.00 | 32.00 |
| int_memory | 3450.00 | 54.57 | 84.97 | 0.01 | 16.00 | 32.00 | 64.00 | 1024.00 |
| ram | 3450.00 | 4.04 | 1.37 | 0.02 | 4.00 | 4.00 | 4.00 | 12.00 |
| battery | 3448.00 | 3133.40 | 1299.68 | 500.00 | 2100.00 | 3000.00 | 4000.00 | 9720.00 |
| weight | 3447.00 | 182.75 | 88.41 | 69.00 | 142.00 | 160.00 | 185.00 | 855.00 |
| days_used | 3454.00 | 674.87 | 248.58 | 91.00 | 533.50 | 690.50 | 868.75 | 1094.00 |
| normalized_used_price | 3454.00 | 4.36 | 0.59 | 1.54 | 4.03 | 4.41 | 4.76 | 6.62 |
| normalized_new_price | 3454.00 | 5.23 | 0.68 | 2.90 | 4.79 | 5.25 | 5.67 | 7.85 |
Normalized used price will be the dependent variable. The mean used price for the data set is 4.36 euros. The median is 4.41 euros. Used prices ranged from 1.54 euros to 6.62 euros.
# Check the statistical summary for the categorical variables.
data.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 |
| os | 3454 | 4 | Android | 3214 |
| 4g | 3454 | 2 | yes | 2335 |
| 5g | 3454 | 2 | no | 3302 |
There are 34 different brand names and 4 different operating systems included in the data set.
# Check for duplicated values.
data.duplicated().sum()
0
There are no duplicated values in the data set.
# Use a histogram to visualize the distribution of a numeric variable.
plt.title('Histogram: Normalized Used Price')
plt.xlabel('Normalized Used Price (euros)')
sns.histplot(data, x='normalized_used_price',kde=True);
The distribution of normalized used device prices appears close to normal, perhaps slightly left-skewed.
# Use a boxplot to visualize distribution and outliers.
plt.title('Boxplot: Normalized Used Price')
sns.boxplot(data=data,x='normalized_used_price')
plt.xlabel('Normalized Used Price (euros)');
The boxplot confirms that the distribution is slightly skewed left with a significant number of outliers on the lower end and a moderate amount of outliers on the upper end.
# Use countplot to visualize the distribution of operating systems.
plt.title('Countplot: Operating Systems')
sns.countplot(data=data, x='os', order=data['os'].value_counts().reset_index()['index'].tolist())
plt.xlabel('OS')
plt.ylabel('Count');
Android is by far the most common OS while iOS is the least common.
# Divide the number of rows where os=Android by the total number of rows & multiply by 100 to get percent.
data[data['os']=='Android'].shape[0]/data.shape[0]*100
93.05153445280834
93% of the used device market is dominated by Android devices.
# Use boxplot to visualize variations in RAM across different brands.
plt.figure(figsize=(20, 5))
plt.title('Boxplot: RAM by Brand')
sns.boxplot(data=data, x="brand_name", y="ram", order=data.groupby(['brand_name'])['ram'].median().reset_index().sort_values('ram')['brand_name'].tolist())
plt.xticks(rotation=45)
plt.xlabel('Brand name')
plt.ylabel('RAM (GB)');
The median RAM across the significant majority of brands is 4 GB. OnePlus has the highest median RAM while Celkon has the lowest.
# Use .groupby() method to determine which brands have the highest and lowest average RAM.
brand_ram = data.groupby(['brand_name'])['ram'].mean().reset_index()
brand_ram = brand_ram.sort_values('ram', ascending=False)
with pd.option_context('display.max_rows',6):
print(brand_ram)
brand_name ram 22 OnePlus 6.36 23 Oppo 4.96 30 Vivo 4.76 .. ... ... 12 Infinix 2.60 21 Nokia 2.42 5 Celkon 1.61 [34 rows x 2 columns]
The brands with the highest mean RAM are OnePlus, Oppo and Vivo.
The brands with the lowest mean RAM are Celkon, Nokia and Infinix.
# Use a scatterplot with best-fit line to visualize the relationship between battery capacity and weight for high-capacity batteries.
sns.lmplot(data=data[data['battery']>4500],x='battery',y='weight')
plt.title('Scatterplot: Battery Capacity vs. Weight')
plt.xlabel('Battery capacity (mAh)')
plt.ylabel('Weight (grams)');
It appears that phones with higher capacity batteries tend to weigh more, as expected.
# Calculate the percentage of phones with battery capacity >4500 mAh.
data[data['battery']>4500].shape[0]/data.shape[0]*100
9.872611464968154
Around 10% of used phones have a high capacity battery.
# Visualize the variations in weight for phones with high capacity batteries across different brands.
plt.figure(figsize=(20,5))
sns.boxplot(data=data[data['battery']>4500], x='brand_name', y='weight', order=data[data['battery']>4500].groupby(['brand_name'])['weight'].median().reset_index().sort_values('weight')['brand_name'].tolist())
plt.xticks(rotation=45)
plt.title('Boxplot: Weight by Brand')
plt.xlabel('Brand')
plt.ylabel('Weight (grams)');
There are significant variations in weight across brands for phones with high capacity batteries. Google and Lenovo phones with high capacity batteries have the highest median weight while Micromax phones have the lowest median weight.
# Create a new DataFrame to add a column converting screen size from centimeters to inches. Index by screen size >6 inches to get percentage of used phones with screen size >6 inches.
data_in = data.copy()
data_in['screen_in'] = data_in['screen_size']*0.393701
data_in[data_in['screen_in']>6].shape[0]/data.shape[0]*100
35.552982049797336
35.6% of all used phones have a screen size over 6 inches.
# Use a countplot to visualize the number of large-screen phones available across different brands.
plt.figure(figsize=(20,5))
sns.countplot(data=data_in[data_in['screen_in']>6], x='brand_name', order=data_in[data_in['screen_in']>6]['brand_name'].value_counts().reset_index()['index'].tolist())
plt.xticks(rotation=45)
plt.title('Countplot: Large Screens by Brand')
plt.xlabel('Brand')
plt.ylabel('Count (screen size > 6 in)');
Huawei and Samsung brands offers the highest number of large-screen phones while Microsoft, Spice and Karbonn offer the least.
# Create a new DataFrame to store the count and percentage of phones with screen size over 6 inches by brand.
large = data_in[data_in['screen_in']>6]['brand_name'].value_counts().rename('large').reset_index()
totals = data['brand_name'].value_counts().rename('total').reset_index()
scrn_pct = pd.merge(large,totals)
scrn_pct.rename(columns={'index':'brand'}, inplace=True)
scrn_pct['pct_large'] = scrn_pct['large']/scrn_pct['total']*100
scrn_pct = scrn_pct.sort_values(by='pct_large', ascending=False)
with pd.option_context('display.max_rows',10):
print(scrn_pct)
brand large total pct_large 11 Realme 41 41 100.00 22 Infinix 10 10 100.00 19 OnePlus 16 22 72.73 3 Vivo 82 117 70.09 0 Huawei 157 251 62.55 .. ... ... ... ... 25 XOLO 4 49 8.16 28 Karbonn 2 29 6.90 29 Spice 2 30 6.67 26 Panasonic 3 47 6.38 30 Microsoft 1 22 4.55 [31 rows x 4 columns]
100% of Realme and Infinix brand phones have a screen size over 6 inches. OnePlus brand has the 3rd highest percentage of phones with large screen size at about 73%. The brands with the lowest percentage of phones with large screen size are Spice, Panasonic and Microsoft.
# Use .shape attribute and logical indexing to calculate percentage of phones with >8 MP selfie cameras.
data[data['selfie_camera_mp']>8].shape[0]/data.shape[0]*100
18.963520555877245
19% of all used phones have a selfie camera with over 8 MP.
# Use a countplot to visualize the number of phones available across different brands with high selfie camera MP.
plt.figure(figsize=(20,5))
sns.countplot(data=data[data['selfie_camera_mp']>8], x='brand_name', order=data[data['selfie_camera_mp']>8]['brand_name'].value_counts().reset_index()['index'].tolist())
plt.title('Countplot: High Selfie Camera MP by Brand')
plt.xlabel('Brand')
plt.ylabel('Count (selfie camera MP > 8)')
plt.xticks(rotation=45);
Huawei, Vivo and Oppo offer the most phones with high selfie camera MP.
Acer, Micromax and Panasonic offer the least phones with high selfie camera MP.
# Create a new DataFrame to store the count and percentage of phones with selfie camera MP >8 by brand.
high = data[data['selfie_camera_mp']>8]['brand_name'].value_counts().rename('over_8').reset_index()
total = data['brand_name'].value_counts().rename('total').reset_index()
sMP_pct = pd.merge(high,total)
sMP_pct.rename(columns={'index':'brand'}, inplace=True)
sMP_pct['pct_over8'] = sMP_pct['over_8']/sMP_pct['total']*100
sMP_pct = sMP_pct.sort_values(by='pct_over8', ascending=False)
with pd.option_context('display.max_rows',10):
print(sMP_pct)
brand over_8 total pct_over8 13 OnePlus 18 22 81.82 1 Vivo 78 117 66.67 2 Oppo 75 129 58.14 3 Xiaomi 63 132 47.73 12 Realme 18 41 43.90 .. ... ... ... ... 6 Others 34 502 6.77 17 Asus 6 122 4.92 23 Panasonic 2 47 4.26 24 Acer 1 51 1.96 22 Micromax 2 117 1.71 [25 rows x 4 columns]
The brands offering the greatest percentage of phones with selfie camera MP over 8 are OnePlus, Vivo, Oppo, Xiaomi and Realme. The brands with the lowest percentage of phones with selfie camera MP over 8 are Asus, Panasonic, Acer and Micromax.
# Use correlation heatmap to visualize which variables have the highest correlation with normalized used price.
plt.figure(figsize=(12,7))
sns.heatmap(data=data.drop('release_year',axis=1).corr().sort_values('normalized_used_price',ascending=False), annot=True)
plt.title('Heatmap: Correlation between numeric variables');
Normalized used price has the strongest correlations with:
# Display pairplot to visualize the relationships between the variables most correlated with normalized used price.
sns.pairplot(data=data[['normalized_used_price','normalized_new_price','battery','selfie_camera_mp','screen_size']]);
There appears to be a nearly-linear relationship between used price and new price. The relationships between used price and other variables appear more complex.
# Use a histogram to visualize the distribution of screen size (numeric variable).
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='screen_size', kde=True, bins=25)
plt.title('Histogram: Screen Size')
plt.xlabel('Screen Size (cm)');
Screen size has a non-normal distribution with a right-skewed tail.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='screen_size')
plt.title('Boxplot: Screen Size')
plt.xlabel('Screen Size (cm)');
The boxplot confirms that screen size data is skewed right with a significant number of outliers on both the high and low ends.
# Use histogram to visualize the distribution of main camera MP.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='main_camera_mp', kde=True)
plt.title('Histogram: Main Camera MP')
plt.xlabel('Main Camera MP');
The distribution of main camera MP is non-normal with a long right tail.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='main_camera_mp')
plt.title('Boxplot: Main Camera MP')
plt.xlabel('Main Camera MP');
The boxplot confirms that main camera MP is right skewed with several outliers on the high end.
# Use histogram to check distribution of selfie camera MP.
plt.figure(figsize=(20,5))
plt.title('Histogram: Selfie Camera MP')
sns.histplot(data=data, x='selfie_camera_mp', kde=True, bins=10)
plt.xlabel('Selfie Camera MP');
The distribution of selfie camera MP is non-normal with a long right tail.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
plt.title('Boxplot: Selfie Camera MP')
sns.boxplot(data=data, x='selfie_camera_mp')
plt.xlabel('Selfie Camera MP');
The boxplot confirms that selfie camera MP is right skewed with several outliers on the high end.
# Use histogram to visualize distribution of internal memory.
plt.figure(figsize=(20,5))
plt.title('Histogram: Internal Memory')
sns.histplot(data=data, x='int_memory', kde=True, bins=25)
plt.xlabel('Internal Memory (GB)');
The distribution of internal memory is non-normal and highly right skewed.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
plt.title('Boxplot: Internal Memory')
sns.boxplot(data=data, x='int_memory')
plt.xlabel('Internal Memory (GB)');
The boxplot confirms that the distribution of internal memory is right skewed and has several outliers on the high end.
# Use countplot to visualize internal memory as values are semi-discrete.
plt.figure(figsize=(20,5))
sns.countplot(data=data, x='int_memory')
plt.title('Countplot: Internal Memory')
plt.xlabel('Internal Memory (GB)')
plt.ylabel('Count');
The countplot shows that 16 GB is the most common amount of internal memory for used phones in the data set.
# Use histogram to visualize the distribution of RAM.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='ram',kde=True)
plt.title('Histogram: RAM')
plt.xlabel('RAM (GB)');
The distribution of RAM has a sharp peak at 4 GB and long tails on both ends, skewed right.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='ram')
plt.title('Boxplot: RAM')
plt.xlabel('RAM (GB)');
The boxplot confirms that there are significant outliers on the high and low ends.
# Use countplot to visualize RAM as values are semi-discrete.
plt.figure(figsize=(20,5))
sns.countplot(data=data, x='ram')
plt.title('Countplot: RAM')
plt.xlabel('RAM (GB)')
plt.ylabel('Count');
By far the most common RAM is 4.0 GB.
# Use histogram to visualize the distribution of battery capacity.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='battery', kde=True)
plt.title('Histogram: Battery Capacity')
plt.xlabel('Battery Capacity (mAh)');
The distribution of battery capacity is non-normal and highly right skewed.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='battery')
plt.title('Boxplot: Battery Capacity')
plt.xlabel('Battery Capacity (mAh)');
The boxplot confirms that the distribution is right skewed with a significant number of outliers on the high end.
# Use histogram to visualize the distribution of weight.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='weight', kde=True)
plt.title('Histogram: Weight')
plt.xlabel('Weight (grams)');
The distribution of weight appears close to normal but with a long right tail.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='weight')
plt.title('Boxplot: Weight')
plt.xlabel('Weight (grams)');
The boxplot confirms that there is a significant number of outliers on the high end and a few outliers on the low end.
# Use histogram to visualize the distribution of days used.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='days_used', kde=True)
plt.title('Histogram: Days Used')
plt.xlabel('Days Used');
The distribution of days used is not normal and somewhat skewed left.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='days_used')
plt.title('Boxplot: Days Used')
plt.xlabel('Days Used');
The boxplot confirms the slight left skewness of the distribution. There are no outliers.
# Use histogram to visualize the distribution of normalized new price.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='normalized_new_price', kde=True, bins=30)
plt.title('Histogram: Normalized New Price')
plt.xlabel('Normalized New Price (euros)');
The distribution of normalized new price appears close to normal.
# Use boxplot to further visualize distribution and outliers.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='normalized_new_price')
plt.title('Boxplot: Normalized New Price')
plt.xlabel('Normalized New Price (euros)');
The boxplot shows that there is a significant number of outliers on both the high and low ends.
# Use countplot to visualize the distribution of release year.
plt.figure(figsize=(20,5))
sns.countplot(data=data, x='release_year')
plt.title('Countplot: Release Year')
plt.xlabel('Release Year')
plt.ylabel('Count');
The greatest number of used phones were released in 2014, while the least were released in 2020.
# Use a lineplot to visualize the change in normalized used price based on release year.
plt.figure(figsize=(20,5))
sns.lineplot(data=data, x='release_year',y='normalized_used_price')
plt.title('Lineplot: Normalized Used Price by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Normalized Used Price (euros)');
Normalized used price tends to increase for more recent release years, though prices are similar for phones released in 2018, 2019 and 2020.
# Use countplot to visualize the distribution of brand names.
plt.figure(figsize=(20,5))
sns.countplot(data=data, x='brand_name', order=data['brand_name'].value_counts().reset_index()['index'].tolist())
plt.xticks(rotation=45)
plt.title('Countplot: Brand Name')
plt.xlabel('Brand Name')
plt.ylabel('Count');
Samsung, Huawei and LG offer the largest number of used phones while OnePlus, Google and Infinix offer the least.
# Use countplot to visualize the distribution of 4G devices.
sns.countplot(data=data, x='4g')
plt.title('Countplot: 4G')
plt.xlabel('4G')
plt.ylabel('Count');
About twice as many phones have 4G availability compared to those that do not.
# Use boxplot to visualize the variation in normalized used price between phones with and without 4G.
sns.boxplot(data=data, x='4g', y='normalized_used_price')
plt.title('Boxplot: Normalized Used Price by 4G')
plt.xlabel('4G')
plt.ylabel('Normalized Used Price (euros)');
It appears that the median used price for phones with 4G is higher than the used price for those without 4G.
# Use countplot to visualize the distribution of 5G devices.
sns.countplot(data=data, x='5g', order=['yes','no'])
plt.title('Countplot: 5G')
plt.xlabel('5G')
plt.ylabel('Count');
The vast majority of used phones do not have 5G available.
# Use boxplot to visualize the variation in normalized used price between phones with and without 5G.
sns.boxplot(data=data, x='5g', y='normalized_used_price', order=['yes','no'])
plt.title('Boxplot: Normalized Used Price by 5G')
plt.xlabel('5G')
plt.ylabel('Normalized Used Price (euros)');
It appears that the median price of used phones with 5G is significantly higher than the used price of phones without 5G.
# Display the count and percentage of missing values in each column.
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| main_camera_mp | 179 | 5.18 |
| selfie_camera_mp | 2 | 0.06 |
| int_memory | 4 | 0.12 |
| ram | 4 | 0.12 |
| battery | 6 | 0.17 |
| weight | 7 | 0.20 |
Main camera MP has the highest number of outliers at about 5%.
Selfie camera MP, internal memory, RAM, battery and weight each have a small number of missing values.
Because there are so many outliers in the data, median will be the preferred central tendency for imputation.
# Create a copy of the data to deal with missing values.
data2 = data.copy()
# Replace null values for main camera MP with the median main camera MP for phones of the same brand.
data2['main_camera_mp'] = data2['main_camera_mp'].fillna(value = data2.groupby(['brand_name','release_year'])['main_camera_mp'].transform('median'))
# Check for remaining null vallues in the main camera MP column.
data2.loc[data2['main_camera_mp'].isnull()==True]
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Infinix | Android | 17.32 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 245 | 4.28 | 4.60 |
| 60 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.36 | 4.71 |
| 61 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 256 | 4.18 | 4.51 |
| 62 | Infinix | Android | 15.39 | yes | no | NaN | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 316 | 4.56 | 4.60 |
| 63 | Infinix | Android | 15.29 | yes | no | NaN | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 468 | 4.42 | 4.87 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3411 | Realme | Android | 15.34 | yes | no | NaN | 16.00 | 64.00 | 4.00 | 4000.00 | 183.00 | 2019 | 503 | 4.57 | 5.16 |
| 3412 | Realme | Android | 15.32 | yes | no | NaN | 16.00 | 64.00 | 4.00 | 4035.00 | 184.00 | 2019 | 433 | 4.52 | 5.07 |
| 3413 | Realme | Android | 15.32 | yes | no | NaN | 25.00 | 64.00 | 4.00 | 4045.00 | 172.00 | 2019 | 288 | 4.78 | 4.97 |
| 3448 | Asus | Android | 16.74 | yes | no | NaN | 24.00 | 128.00 | 8.00 | 6000.00 | 240.00 | 2019 | 325 | 5.72 | 7.06 |
| 3449 | Asus | Android | 15.34 | yes | no | NaN | 8.00 | 64.00 | 6.00 | 5000.00 | 190.00 | 2019 | 232 | 4.49 | 6.48 |
179 rows × 15 columns
# Some of the values couldn't be imputed by brand name and release year, so they will be imputed by brand name alone.
data2['main_camera_mp'] = data2['main_camera_mp'].fillna(value = data2.groupby(['brand_name'])['main_camera_mp'].transform('median'))
# Check for remaining null vallues in the main camera MP column.
data2.loc[data2['main_camera_mp'].isnull()==True]
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | Infinix | Android | 17.32 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 245 | 4.28 | 4.60 |
| 60 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.36 | 4.71 |
| 61 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 256 | 4.18 | 4.51 |
| 62 | Infinix | Android | 15.39 | yes | no | NaN | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 316 | 4.56 | 4.60 |
| 63 | Infinix | Android | 15.29 | yes | no | NaN | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 468 | 4.42 | 4.87 |
| 278 | Infinix | Android | 17.32 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 6000.00 | 209.00 | 2020 | 320 | 4.41 | 4.61 |
| 279 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 64.00 | 4.00 | 5000.00 | 185.00 | 2020 | 173 | 4.50 | 4.70 |
| 280 | Infinix | Android | 15.39 | yes | no | NaN | 8.00 | 32.00 | 2.00 | 5000.00 | 185.00 | 2020 | 329 | 4.37 | 4.49 |
| 281 | Infinix | Android | 15.39 | yes | no | NaN | 16.00 | 32.00 | 3.00 | 4000.00 | 178.00 | 2019 | 356 | 4.42 | 4.61 |
| 282 | Infinix | Android | 15.29 | yes | no | NaN | 16.00 | 32.00 | 2.00 | 4000.00 | 165.00 | 2019 | 497 | 4.42 | 4.87 |
# None of the Infinix brand phones have values for main camera MP, so these values cannot be imputed by brand.
# Instead the values can be imputed by release year.
data2['main_camera_mp'] = data2['main_camera_mp'].fillna(value = data2.groupby(['release_year'])['main_camera_mp'].transform('median'))
# Verify that there are no remaining null values in the main camera MP column.
data2['main_camera_mp'].isnull().sum()
0
# Impute the remaining null values using median values grouped by brand & release year.
data2['selfie_camera_mp'] = data2['selfie_camera_mp'].fillna(value = data2.groupby(['brand_name','release_year'])['selfie_camera_mp'].transform('median'))
data2['int_memory'] = data2['int_memory'].fillna(value = data2.groupby(['brand_name','release_year'])['int_memory'].transform('median'))
data2['ram'] = data2['ram'].fillna(value = data2.groupby(['brand_name','release_year'])['ram'].transform('median'))
data2['battery'] = data2['battery'].fillna(value = data2.groupby(['brand_name','release_year'])['battery'].transform('median'))
data2['weight'] = data2['weight'].fillna(value = data2.groupby(['brand_name','release_year'])['weight'].transform('median'))
# Check if there are any remaining null values.
data2.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 2 int_memory 0 ram 0 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# The remaining null values can be imputed by brand alone.
data2['selfie_camera_mp'] = data2['selfie_camera_mp'].fillna(value = data2.groupby(['brand_name',])['selfie_camera_mp'].transform('median'))
data2['battery'] = data2['battery'].fillna(value = data2.groupby(['brand_name'])['battery'].transform('median'))
data2['weight'] = data2['weight'].fillna(value = data2.groupby(['brand_name'])['weight'].transform('median'))
# Verify that there are no remaining null values.
data2.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# Replace release_year with a new column for years_since_release to allow more meaningful statistical analysis.
data2['years_since_release'] = 2022 - data2['release_year']
data2.drop('release_year', axis=1, inplace=True)
# Check the statistical summary of the new column.
data2['years_since_release'].describe()
count 3454.00 mean 6.03 std 2.30 min 2.00 25% 4.00 50% 6.50 75% 8.00 max 9.00 Name: years_since_release, dtype: float64
The average years since release is around 6. Years since release ranges from 2 to 9 years.
# Create a variable for the numeric columns.
numeric_columns = ['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight', 'days_used', 'normalized_used_price', 'normalized_new_price','years_since_release']
# Display boxplots for each numeric variable to visualize outliers.
plt.figure(figsize=(20,10))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(data2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
# Double-check the statistical summary to determine if maximum and minimum values are reasonable based on domain knowledge.
data2.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.00 | 13.71 | 3.81 | 5.08 | 12.70 | 12.83 | 15.34 | 30.71 |
| main_camera_mp | 3454.00 | 9.63 | 4.75 | 0.08 | 5.00 | 8.10 | 13.00 | 48.00 |
| selfie_camera_mp | 3454.00 | 6.56 | 6.97 | 0.00 | 2.00 | 5.00 | 8.00 | 32.00 |
| int_memory | 3454.00 | 54.53 | 84.93 | 0.01 | 16.00 | 32.00 | 64.00 | 1024.00 |
| ram | 3454.00 | 4.03 | 1.37 | 0.02 | 4.00 | 4.00 | 4.00 | 12.00 |
| battery | 3454.00 | 3132.58 | 1298.88 | 500.00 | 2100.00 | 3000.00 | 4000.00 | 9720.00 |
| weight | 3454.00 | 182.64 | 88.36 | 69.00 | 142.00 | 160.00 | 185.00 | 855.00 |
| days_used | 3454.00 | 674.87 | 248.58 | 91.00 | 533.50 | 690.50 | 868.75 | 1094.00 |
| normalized_used_price | 3454.00 | 4.36 | 0.59 | 1.54 | 4.03 | 4.41 | 4.76 | 6.62 |
| normalized_new_price | 3454.00 | 5.23 | 0.68 | 2.90 | 4.79 | 5.25 | 5.67 | 7.85 |
| years_since_release | 3454.00 | 6.03 | 2.30 | 2.00 | 4.00 | 6.50 | 8.00 | 9.00 |
There are significant outliers in the data, but they should not be treated as they represent actual values.
# Compare the statistical summary for the processed data set to that of the original data set.
data2.describe().T.drop(['days_used','normalized_new_price','normalized_used_price','screen_size','years_since_release']) / data.describe().T.drop(['days_used','normalized_new_price','normalized_used_price','screen_size','release_year'])
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| main_camera_mp | 1.05 | 1.02 | 0.99 | 1.00 | 1.00 | 1.01 | 1.00 | 1.00 |
| selfie_camera_mp | 1.00 | 1.00 | 1.00 | NaN | 1.00 | 1.00 | 1.00 | 1.00 |
| int_memory | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ram | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| battery | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| weight | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
It appears that the imputation of values has not significantly affected the data.
Let's repeat EDA on main camera MP as it was the column most affected by imputation.
# Use histogram to visualize the distribution of main camera MP for the original data.
plt.figure(figsize=(20,5))
sns.histplot(data=data, x='main_camera_mp', kde=True)
plt.title('Histogram: Main Camera MP')
plt.xlabel('Main Camera MP');
# Use histogram to visualize the distribution of main camera MP for the processed data.
plt.figure(figsize=(20,5))
sns.histplot(data=data2, x='main_camera_mp', kde=True)
plt.title('Histogram: Main Camera MP')
plt.xlabel('Main Camera MP');
The distribution of the processed data set is largely similar to the original data set.
# Use boxplot to further visualize distribution and outliers for the original data set.
plt.figure(figsize=(20,5))
sns.boxplot(data=data, x='main_camera_mp')
plt.title('Boxplot: Main Camera MP')
plt.xlabel('Main Camera MP');
# Use boxplot to further visualize distribution and outliers for the processed data.
plt.figure(figsize=(20,5))
sns.boxplot(data=data2, x='main_camera_mp')
plt.title('Boxplot: Main Camera MP')
plt.xlabel('Main Camera MP');
The boxplot of main camera MP for the processed data is identical to that of the original data set.
Imputation of missing values has not made a noticeable impact on the distribution of the data.
# Define X and y variables in order to predict used phone price.
X = data2.drop(['normalized_used_price'], axis=1)
y = data2['normalized_used_price']
print(X.head())
print()
print(y.head())
brand_name os screen_size 4g 5g main_camera_mp \ 0 Honor Android 14.50 yes no 13.00 1 Honor Android 17.30 yes yes 13.00 2 Honor Android 16.69 yes yes 13.00 3 Honor Android 25.50 yes yes 13.00 4 Honor Android 15.32 yes no 13.00 selfie_camera_mp int_memory ram battery weight days_used \ 0 5.00 64.00 3.00 3020.00 146.00 127 1 16.00 128.00 8.00 4300.00 213.00 325 2 8.00 128.00 8.00 4200.00 213.00 162 3 8.00 64.00 6.00 7250.00 480.00 345 4 8.00 64.00 3.00 5000.00 185.00 293 normalized_new_price years_since_release 0 4.72 2 1 5.52 2 2 5.88 2 3 5.63 2 4 4.95 2 0 4.31 1 5.16 2 5.11 3 5.14 4 4.39 Name: normalized_used_price, dtype: float64
# Add the intercept.
X = sm.add_constant(X)
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
# Create dummy variables for categorical (string/object) columns.
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=['object', 'category']).columns.tolist(),
drop_first=True,
)
X.head()
| const | screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | days_used | normalized_new_price | ... | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00 | 14.50 | 13.00 | 5.00 | 64.00 | 3.00 | 3020.00 | 146.00 | 127 | 4.72 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1.00 | 17.30 | 13.00 | 16.00 | 128.00 | 8.00 | 4300.00 | 213.00 | 325 | 5.52 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 1.00 | 16.69 | 13.00 | 8.00 | 128.00 | 8.00 | 4200.00 | 213.00 | 162 | 5.88 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 1.00 | 25.50 | 13.00 | 8.00 | 64.00 | 6.00 | 7250.00 | 480.00 | 345 | 5.63 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 1.00 | 15.32 | 13.00 | 8.00 | 64.00 | 3.00 | 5000.00 | 185.00 | 293 | 4.95 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 49 columns
# Split the data into training and testing sets in a 70:30 ratio.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2417 Number of rows in test data = 1037
The number of rows in the training and testing sets confirms that the data was split in a 70:30 ratio.
# Start by creating a model that incorporates every single independent variable in the data set.
olsmodel = sm.OLS(y_train, x_train).fit()
print(olsmodel.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.842
Method: Least Squares F-statistic: 268.7
Date: Fri, 07 Oct 2022 Prob (F-statistic): 0.00
Time: 00:24:05 Log-Likelihood: 123.85
No. Observations: 2417 AIC: -149.7
Df Residuals: 2368 BIC: 134.0
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 1.3393 0.072 18.563 0.000 1.198 1.481
screen_size 0.0244 0.003 7.163 0.000 0.018 0.031
main_camera_mp 0.0208 0.002 13.848 0.000 0.018 0.024
selfie_camera_mp 0.0135 0.001 11.997 0.000 0.011 0.016
int_memory 0.0001 6.97e-05 1.651 0.099 -2.16e-05 0.000
ram 0.0230 0.005 4.451 0.000 0.013 0.033
battery -1.689e-05 7.27e-06 -2.321 0.020 -3.12e-05 -2.62e-06
weight 0.0010 0.000 7.480 0.000 0.001 0.001
days_used 4.216e-05 3.09e-05 1.366 0.172 -1.84e-05 0.000
normalized_new_price 0.4311 0.012 35.147 0.000 0.407 0.455
years_since_release -0.0237 0.005 -5.193 0.000 -0.033 -0.015
brand_name_Alcatel 0.0154 0.048 0.323 0.747 -0.078 0.109
brand_name_Apple -0.0038 0.147 -0.026 0.980 -0.292 0.285
brand_name_Asus 0.0151 0.048 0.314 0.753 -0.079 0.109
brand_name_BlackBerry -0.0300 0.070 -0.427 0.669 -0.168 0.108
brand_name_Celkon -0.0468 0.066 -0.707 0.480 -0.177 0.083
brand_name_Coolpad 0.0209 0.073 0.287 0.774 -0.122 0.164
brand_name_Gionee 0.0448 0.058 0.775 0.438 -0.068 0.158
brand_name_Google -0.0326 0.085 -0.385 0.700 -0.199 0.133
brand_name_HTC -0.0130 0.048 -0.270 0.787 -0.108 0.081
brand_name_Honor 0.0317 0.049 0.644 0.520 -0.065 0.128
brand_name_Huawei -0.0020 0.044 -0.046 0.964 -0.089 0.085
brand_name_Infinix 0.0592 0.093 0.634 0.526 -0.124 0.242
brand_name_Karbonn 0.0943 0.067 1.405 0.160 -0.037 0.226
brand_name_LG -0.0132 0.045 -0.291 0.771 -0.102 0.076
brand_name_Lava 0.0332 0.062 0.533 0.594 -0.089 0.155
brand_name_Lenovo 0.0454 0.045 1.004 0.316 -0.043 0.134
brand_name_Meizu -0.0129 0.056 -0.230 0.818 -0.123 0.097
brand_name_Micromax -0.0337 0.048 -0.704 0.481 -0.128 0.060
brand_name_Microsoft 0.0952 0.088 1.078 0.281 -0.078 0.268
brand_name_Motorola -0.0112 0.050 -0.226 0.821 -0.109 0.086
brand_name_Nokia 0.0719 0.052 1.387 0.166 -0.030 0.174
brand_name_OnePlus 0.0709 0.077 0.916 0.360 -0.081 0.223
brand_name_Oppo 0.0124 0.048 0.261 0.794 -0.081 0.106
brand_name_Others -0.0080 0.042 -0.190 0.849 -0.091 0.075
brand_name_Panasonic 0.0563 0.056 1.008 0.314 -0.053 0.166
brand_name_Realme 0.0319 0.062 0.518 0.605 -0.089 0.153
brand_name_Samsung -0.0313 0.043 -0.725 0.469 -0.116 0.053
brand_name_Sony -0.0616 0.050 -1.220 0.223 -0.161 0.037
brand_name_Spice -0.0147 0.063 -0.233 0.816 -0.139 0.109
brand_name_Vivo -0.0154 0.048 -0.318 0.750 -0.110 0.080
brand_name_XOLO 0.0152 0.055 0.277 0.782 -0.092 0.123
brand_name_Xiaomi 0.0869 0.048 1.806 0.071 -0.007 0.181
brand_name_ZTE -0.0057 0.047 -0.121 0.904 -0.099 0.087
os_Others -0.0510 0.033 -1.555 0.120 -0.115 0.013
os_Windows -0.0207 0.045 -0.459 0.646 -0.109 0.068
os_iOS -0.0663 0.146 -0.453 0.651 -0.354 0.221
4g_yes 0.0528 0.016 3.326 0.001 0.022 0.084
5g_yes -0.0714 0.031 -2.268 0.023 -0.133 -0.010
==============================================================================
Omnibus: 223.612 Durbin-Watson: 1.910
Prob(Omnibus): 0.000 Jarque-Bera (JB): 422.275
Skew: -0.620 Prob(JB): 2.01e-92
Kurtosis: 4.630 Cond. No. 1.78e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
The R-squared value indicates that the model can explain 84.5% of the variance in the training set.
# Create user-defined functions to compute measures of model performance.
# Define a function to compute adjusted R-squared.
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# Define a function to compute MAPE.
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# Define a function to compute different metrics to check performance of a regression model.
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# Predicting using the independent variables.
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# Create a dataframe of metrics.
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# Check model performance on the training set.
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.33 |
# Check model performance on the test set.
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.18 | 0.84 | 0.83 | 4.50 |
The training R-squared is 0.84, so the model is not underfitting.
The train and test RMSE and MAE are comparable, so the model is not overfitting either.
MAE suggests that the model can predict normalized used price within a mean error of 0.18 on the test data.
MAPE of 4.5 on the test data means that we are able to predict within 4.5% of the normalized used price.
# Define a function to check VIF.
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
# Display the columns where VIF is > 5.
checking_vif(x_train)[checking_vif(x_train)['VIF']>5]
| feature | VIF | |
|---|---|---|
| 0 | const | 233.24 |
| 1 | screen_size | 7.68 |
| 7 | weight | 6.40 |
| 12 | brand_name_Apple | 13.06 |
| 21 | brand_name_Huawei | 5.98 |
| 34 | brand_name_Others | 9.71 |
| 37 | brand_name_Samsung | 7.54 |
| 46 | os_iOS | 11.78 |
There are multiple columns with very high VIF values, indicating strong multicollinearity. We will systematically drop numerical columns with VIF > 5 and ignore the VIF values for dummy variables and the constant (intercept).
# Define a function to treat multicollinearity.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
# Apply the function to the columns with VIF > 5, ignoring the constant and dummy variables.
col_list = ['screen_size', 'weight']
res = treating_multicollinearity(x_train, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | screen_size | 0.84 | 0.23 |
| 1 | weight | 0.84 | 0.23 |
# Drop the screen size column.
col_to_drop = 'screen_size'
x_train2 = x_train.loc[:, ~x_train.columns.str.startswith(col_to_drop)]
x_test2 = x_test.loc[:, ~x_test.columns.str.startswith(col_to_drop)]
# Check VIF after dropping screen size.
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif[vif['VIF']>5]
VIF after dropping screen_size
| feature | VIF | |
|---|---|---|
| 0 | const | 206.34 |
| 11 | brand_name_Apple | 13.00 |
| 20 | brand_name_Huawei | 5.98 |
| 33 | brand_name_Others | 9.65 |
| 36 | brand_name_Samsung | 7.52 |
| 45 | os_iOS | 11.68 |
The VIF for all non-constant, non-dummy variables are less than 5. Multicollinearity has been resolved.
# Print the summary of the new model.
olsmod2 = sm.OLS(y_train, x_train2).fit()
print(olsmod2.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.842
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 267.7
Date: Fri, 07 Oct 2022 Prob (F-statistic): 0.00
Time: 00:31:00 Log-Likelihood: 97.950
No. Observations: 2417 AIC: -99.90
Df Residuals: 2369 BIC: 178.0
Df Model: 47
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 1.5148 0.069 22.089 0.000 1.380 1.649
main_camera_mp 0.0212 0.002 13.979 0.000 0.018 0.024
selfie_camera_mp 0.0138 0.001 12.128 0.000 0.012 0.016
int_memory 9.551e-05 7.04e-05 1.356 0.175 -4.26e-05 0.000
ram 0.0229 0.005 4.397 0.000 0.013 0.033
battery -4.284e-06 7.13e-06 -0.601 0.548 -1.83e-05 9.7e-06
weight 0.0017 9.22e-05 18.376 0.000 0.002 0.002
days_used 2.773e-05 3.11e-05 0.891 0.373 -3.33e-05 8.88e-05
normalized_new_price 0.4413 0.012 35.841 0.000 0.417 0.465
years_since_release -0.0297 0.005 -6.568 0.000 -0.039 -0.021
brand_name_Alcatel 0.0177 0.048 0.368 0.713 -0.077 0.112
brand_name_Apple 0.0660 0.148 0.445 0.656 -0.225 0.357
brand_name_Asus 0.0013 0.048 0.027 0.978 -0.094 0.096
brand_name_BlackBerry -0.0444 0.071 -0.626 0.531 -0.183 0.095
brand_name_Celkon -0.0518 0.067 -0.773 0.440 -0.183 0.080
brand_name_Coolpad 0.0136 0.074 0.185 0.854 -0.131 0.158
brand_name_Gionee 0.0154 0.058 0.265 0.791 -0.099 0.130
brand_name_Google -0.0587 0.085 -0.687 0.492 -0.226 0.109
brand_name_HTC -0.0321 0.049 -0.660 0.509 -0.127 0.063
brand_name_Honor 0.0352 0.050 0.708 0.479 -0.062 0.133
brand_name_Huawei -0.0089 0.045 -0.199 0.842 -0.097 0.079
brand_name_Infinix 0.0450 0.094 0.477 0.634 -0.140 0.230
brand_name_Karbonn 0.0998 0.068 1.472 0.141 -0.033 0.233
brand_name_LG -0.0325 0.046 -0.712 0.476 -0.122 0.057
brand_name_Lava 0.0276 0.063 0.438 0.661 -0.096 0.151
brand_name_Lenovo 0.0345 0.046 0.756 0.450 -0.055 0.124
brand_name_Meizu -0.0282 0.057 -0.498 0.618 -0.139 0.083
brand_name_Micromax -0.0468 0.048 -0.968 0.333 -0.142 0.048
brand_name_Microsoft 0.0773 0.089 0.866 0.387 -0.098 0.252
brand_name_Motorola -0.0329 0.050 -0.658 0.511 -0.131 0.065
brand_name_Nokia 0.0473 0.052 0.906 0.365 -0.055 0.150
brand_name_OnePlus 0.0684 0.078 0.874 0.382 -0.085 0.222
brand_name_Oppo -0.0006 0.048 -0.012 0.991 -0.095 0.094
brand_name_Others -0.0314 0.042 -0.741 0.459 -0.115 0.052
brand_name_Panasonic 0.0482 0.056 0.855 0.393 -0.062 0.159
brand_name_Realme 0.0147 0.062 0.236 0.813 -0.107 0.137
brand_name_Samsung -0.0458 0.044 -1.049 0.294 -0.131 0.040
brand_name_Sony -0.0776 0.051 -1.523 0.128 -0.178 0.022
brand_name_Spice -0.0407 0.064 -0.638 0.524 -0.166 0.084
brand_name_Vivo -0.0206 0.049 -0.421 0.674 -0.117 0.075
brand_name_XOLO 0.0111 0.055 0.201 0.841 -0.097 0.120
brand_name_Xiaomi 0.0734 0.049 1.511 0.131 -0.022 0.169
brand_name_ZTE -0.0219 0.048 -0.457 0.647 -0.116 0.072
os_Others -0.1345 0.031 -4.339 0.000 -0.195 -0.074
os_Windows -0.0182 0.046 -0.399 0.690 -0.108 0.071
os_iOS -0.1657 0.147 -1.125 0.261 -0.455 0.123
4g_yes 0.0508 0.016 3.167 0.002 0.019 0.082
5g_yes -0.0815 0.032 -2.563 0.010 -0.144 -0.019
==============================================================================
Omnibus: 234.761 Durbin-Watson: 1.907
Prob(Omnibus): 0.000 Jarque-Bera (JB): 440.912
Skew: -0.647 Prob(JB): 1.81e-96
Kurtosis: 4.644 Cond. No. 1.78e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.78e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Adjusted R-squared has dropped from 0.842 to 0.838, which shows that the dropped columns did not have much effect on the model.
# Define the initial list of columns.
predictors = x_train2.copy()
cols = predictors.columns.tolist()
# Set an initial max p-value.
max_p_value = 1
# Use a while loop to drop columns with the highest p-values one-by-one, recalculating all p-values after each drop until none are above 0.05.
while len(cols) > 0:
# defining the train set
x_train_aux = predictors[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'weight', 'normalized_new_price', 'years_since_release', 'brand_name_Karbonn', 'brand_name_Samsung', 'brand_name_Sony', 'brand_name_Xiaomi', 'os_Others', 'os_iOS', '4g_yes', '5g_yes']
# Create new training and testing sets with high p-value columns dropped.
x_train3 = x_train2[selected_features]
x_test3 = x_test2[selected_features]
# Print the summary of the new model.
olsmod3 = sm.OLS(y_train, x_train3).fit()
print(olsmod3.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 896.9
Date: Fri, 07 Oct 2022 Prob (F-statistic): 0.00
Time: 00:35:33 Log-Likelihood: 82.004
No. Observations: 2417 AIC: -134.0
Df Residuals: 2402 BIC: -47.15
Df Model: 14
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5228 0.049 30.802 0.000 1.426 1.620
main_camera_mp 0.0211 0.001 14.814 0.000 0.018 0.024
selfie_camera_mp 0.0138 0.001 12.838 0.000 0.012 0.016
ram 0.0211 0.005 4.223 0.000 0.011 0.031
weight 0.0017 6e-05 27.698 0.000 0.002 0.002
normalized_new_price 0.4419 0.011 39.458 0.000 0.420 0.464
years_since_release -0.0289 0.003 -8.496 0.000 -0.036 -0.022
brand_name_Karbonn 0.1155 0.055 2.110 0.035 0.008 0.223
brand_name_Samsung -0.0373 0.016 -2.261 0.024 -0.070 -0.005
brand_name_Sony -0.0673 0.030 -2.210 0.027 -0.127 -0.008
brand_name_Xiaomi 0.0806 0.026 3.139 0.002 0.030 0.131
os_Others -0.1258 0.027 -4.603 0.000 -0.179 -0.072
os_iOS -0.0897 0.045 -1.988 0.047 -0.178 -0.001
4g_yes 0.0498 0.015 3.299 0.001 0.020 0.079
5g_yes -0.0671 0.031 -2.189 0.029 -0.127 -0.007
==============================================================================
Omnibus: 245.329 Durbin-Watson: 1.901
Prob(Omnibus): 0.000 Jarque-Bera (JB): 481.790
Skew: -0.656 Prob(JB): 2.40e-105
Kurtosis: 4.749 Cond. No. 2.39e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Check model performance on the training set.
print("Training Performance\n")
olsmod3_train_perf = model_performance_regression(olsmod3, x_train3, y_train)
olsmod3_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.39 |
# Check model performance on the test set.
print("Test Performance\n")
olsmod3_test_perf = model_performance_regression(olsmod3, x_test3, y_test)
olsmod3_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.19 | 0.84 | 0.84 | 4.55 |
The new R-squared is 0.838, which means the model is able to explain ~84% of the variance.
The adjusted R-squared in olsmod2 was 0.838; this shows that the variables dropped are not affecting the model.
RMSE and MAE values are comparable for train and test sets, indicating that the model is not overfitting.
# Create a dataframe with actual, fitted and residual values.
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod3.fittedvalues # predicted values
df_pred["Residuals"] = olsmod3.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 3026 | 4.09 | 3.87 | 0.22 |
| 1525 | 4.45 | 4.60 | -0.15 |
| 1128 | 4.32 | 4.29 | 0.03 |
| 3003 | 4.28 | 4.19 | 0.09 |
| 2907 | 4.46 | 4.49 | -0.03 |
# Plot the fitted values vs. residuals.
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot");
There is no pattern in the plot. The assumptions of linearity and independence are satisfied.
# Display histogram to check for normality of residuals.
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of Residuals");
The distribution of residuals appears close to normal though left-skewed.
# Display Q-Q plot to visualize normality.
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab);
The Q-Q plot indicates that the distribution is close to normal with deviation from normal along the high and low ends.
# Use Shapiro-Wilk test for normality.
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9678095579147339, pvalue=7.639652123587057e-23)
Since the p-value is less than 0.05, the residuals are not normal. However, the distribution is approximately normal per the histogram and Q-Q plot. The assumption is satisfied.
# Check for homoscedasticity with Goldfeld–Quandt test.
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
[('F statistic', 1.009794401185382), ('p-value', 0.43316051505003356)]
Since the p-value is greater than 0.05, the residuals are homoscedastic. The assumption is satisfied.
# Make copies of the final training and testing sets to test the final model.
x_train_final = x_train3.copy()
x_test_final = x_test3.copy()
# Print the summary of the final model.
olsmodel_final = sm.OLS(y_train, x_train_final).fit()
print(olsmodel_final.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 896.9
Date: Fri, 07 Oct 2022 Prob (F-statistic): 0.00
Time: 00:39:35 Log-Likelihood: 82.004
No. Observations: 2417 AIC: -134.0
Df Residuals: 2402 BIC: -47.15
Df Model: 14
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.5228 0.049 30.802 0.000 1.426 1.620
main_camera_mp 0.0211 0.001 14.814 0.000 0.018 0.024
selfie_camera_mp 0.0138 0.001 12.838 0.000 0.012 0.016
ram 0.0211 0.005 4.223 0.000 0.011 0.031
weight 0.0017 6e-05 27.698 0.000 0.002 0.002
normalized_new_price 0.4419 0.011 39.458 0.000 0.420 0.464
years_since_release -0.0289 0.003 -8.496 0.000 -0.036 -0.022
brand_name_Karbonn 0.1155 0.055 2.110 0.035 0.008 0.223
brand_name_Samsung -0.0373 0.016 -2.261 0.024 -0.070 -0.005
brand_name_Sony -0.0673 0.030 -2.210 0.027 -0.127 -0.008
brand_name_Xiaomi 0.0806 0.026 3.139 0.002 0.030 0.131
os_Others -0.1258 0.027 -4.603 0.000 -0.179 -0.072
os_iOS -0.0897 0.045 -1.988 0.047 -0.178 -0.001
4g_yes 0.0498 0.015 3.299 0.001 0.020 0.079
5g_yes -0.0671 0.031 -2.189 0.029 -0.127 -0.007
==============================================================================
Omnibus: 245.329 Durbin-Watson: 1.901
Prob(Omnibus): 0.000 Jarque-Bera (JB): 481.790
Skew: -0.656 Prob(JB): 2.40e-105
Kurtosis: 4.749 Cond. No. 2.39e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Check model performance on the training set.
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(
olsmodel_final, x_train_final, y_train
)
olsmodel_final_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.23 | 0.18 | 0.84 | 0.84 | 4.39 |
# Check model performance on the test set.
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(
olsmodel_final, x_test_final, y_test
)
olsmodel_final_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.24 | 0.19 | 0.84 | 0.84 | 4.55 |
The model is able to explain ~84% of the variation in the data.
The train and test RMSE and MAE are low and comparable, so the model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within 4.55% of the normalized used price.
Hence, the model olsmodel_final is good for prediction as well as inference purposes.
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Import libraries for data manipulation.
import pandas as pd
import numpy as np
# Import libraries for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Import library to split data.
from sklearn.model_selection import train_test_split
# Import predictive model-building libraries.
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Import GridSearch to tune different models.
from sklearn.model_selection import GridSearchCV
# Import functions to evaluate model performance.
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# Import libraries to ignore irrelevant warnings.
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Set the precision of floating numbers to 5 decimal points.
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Load the data set.
data = pd.read_csv('/content/drive/My Drive/Data science/Data sets/INNHotelsGroup.csv')
# Use .shape attribute to display the number of rows & columns in the data set.
data.shape
(36275, 19)
The data set has 36275 rows and 19 columns.
# Display a sample of 5 rows from the data set to get a general idea of the information & make sure it is loaded properly.
data.sample(5)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22995 | INN22996 | 2 | 0 | 1 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 141 | 2018 | 5 | 30 | Online | 0 | 0 | 0 | 99.96 | 0 | Canceled |
| 10065 | INN10066 | 1 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 77 | 2018 | 10 | 30 | Online | 0 | 0 | 0 | 85.50 | 2 | Not_Canceled |
| 14837 | INN14838 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 4 | 1 | 2018 | 2 | 17 | Online | 0 | 0 | 0 | 66.30 | 1 | Not_Canceled |
| 8360 | INN08361 | 2 | 0 | 0 | 3 | Meal Plan 1 | 0 | Room_Type 5 | 11 | 2017 | 12 | 29 | Offline | 0 | 0 | 0 | 157.67 | 0 | Not_Canceled |
| 23229 | INN23230 | 2 | 0 | 2 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 76 | 2018 | 4 | 10 | Online | 0 | 0 | 0 | 90.95 | 1 | Not_Canceled |
# Check the number of unique values in the Booking_ID column to see if it should be dropped.
data.Booking_ID.nunique()
36275
# All the Booking ID values are unique, so the column can be dropped as it will not aid in analysis.
data.drop('Booking_ID', axis=1, inplace=True)
# Use .info() method to check the non-null counts and data types of each column.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null object 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null object 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null object 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(4) memory usage: 5.0+ MB
None of the columns have null values.
Columns of object type: Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, booking_status
Columns of numeric type: no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, required_car_parking_space, lead_time, arrival_year, arrival_month, arrival_date, repeated_guest, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, avg_price_per_room, no_of_special_requests
# Check the statistical summary for the numeric columns, dropping categorical columns.
data.describe().T.drop(['required_car_parking_space','repeated_guest','arrival_year','arrival_month','arrival_date'])
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
The number of adults ranges from 0 to 4 with a median of 2. The number of children ranges from 0 to 10 with a median of 10. Number of weekend nights ranges from 0 to 7 with median of 1. Number of week nights ranges from 0 to 17 with a median of 2. Lead time ranges from 0 to 443 days with a median of 57 days. Number of previous cancellations ranges from 0 to 13 with median of 0. Previous bookings not canceled ranges from 0 to 58 with median of 0. Average price per room ranges from 0 to 540 euros with a median of about 100 euros. Number of special requests ranges from 0 to 5 with a median of 0.
# Check the statistical summary for the categorical variables.
data.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 |
| market_segment_type | 36275 | 5 | Online | 23214 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 |
There are 4 types of meal plans (including those who did not select a meal plan) with the most popular being Meal Plan 1. There are seven types of rooms with the most popular being Room Type 1. There are 5 market segment types with the most common being Online. Bookings are more likely to be not canceled vs. canceled.
# Check for duplicated values.
data.duplicated().sum()
0
There are no duplicated values in the data set.
# Use a countplot to visualize which months are busiest.
plt.title('Countplot: Arrival Month')
sns.countplot(data=data, x='arrival_month')
plt.xlabel("Arrival Month")
plt.ylabel('Count');
The busiest month is October, followed by September and August.
# Use a countplot to visualize the distribution of market segment types.
plt.title('Countplot: Market Segment Type')
sns.countplot(data=data, x='market_segment_type', order=data['market_segment_type'].value_counts().reset_index()['index'].tolist())
plt.xlabel("Market Segment Type")
plt.ylabel('Count');
The majority of guests come from the online market segment. The next most common market segment is offline.
# Use a boxplot to visualize variations in distribution of average room price by market segment type.
plt.title('Boxplot: Average Room Price by Market Segment Type')
sns.boxplot(data=data, x='market_segment_type', y='avg_price_per_room', order=data.groupby(['market_segment_type'])['avg_price_per_room'].median().reset_index().sort_values('avg_price_per_room')['market_segment_type'].tolist())
plt.xlabel("Market Segment Type")
plt.ylabel('Average price per room (euros)');
The online market segment has the highest average room price, while the complementary market segment has the lowest.
# Use a countplot to visualize the number of bookings canceled vs. not canceled.
plt.title('Countplot: Booking Status')
sns.countplot(data=data, x='booking_status')
plt.xlabel("Booking status")
plt.ylabel('Count');
# Use logical indexing with .shape attribute to calculate the percent of bookings canceled.
data[data['booking_status']=='Canceled'].shape[0]/data.shape[0]*100
32.76361130254997
About 33% of bookings are canceled.
# Use logical indexing and .shape attribute to calculate percentage of bookings canceled by repeating guests.
repeats = data[data['repeated_guest']==1]
repeats[repeats['booking_status']=='Canceled'].shape[0]/data[data['repeated_guest']==1].shape[0]*100
1.7204301075268817
Slightly less than 2% of repeating guests cancel their bookings.
# Use countplot with hue parameter to visualize differences in cancellation by number of special requests.
plt.title('Countplot: Cancellation by Number of Special Requests')
sns.countplot(data=data, x='no_of_special_requests', hue='booking_status')
plt.legend(loc='upper right')
plt.xlabel("Number of special requests")
plt.ylabel('Count');
It appears that the likelihood of cancellation decreases as the number of special requests increases.
# Use a countplot to visualize the distribution of number of adults.
plt.title('Countplot: Number of Adults')
sns.countplot(data=data, x='no_of_adults')
plt.xlabel("Number of Adults")
plt.ylabel('Count');
The most common number of adults per booking is 2, followed by 1 then 3.
# Use a countplot to visualize the distribution of number of children.
plt.title('Countplot: Number of Children')
sns.countplot(data=data, x='no_of_children')
plt.xlabel("Number of Children")
plt.ylabel('Count');
The vast majority of bookings do not include any children. For those with children, 1 or 2 are most common, with a few outliers of 3, 9 and 10.
# Use a countplot to visualize the distribution of number of weekend nights.
plt.title('Countplot: Number of Weekend Nights')
sns.countplot(data=data, x='no_of_weekend_nights')
plt.xlabel("Number of Weekend Nights")
plt.ylabel('Count');
The most common number of weekend nights is 0, followed by 1 and 2. The data is skewed right with outliers on the upper end.
# Use a countplot to visualize the distribution of number of week nights.
plt.title('Countplot: Number of Week Nights')
sns.countplot(data=data, x='no_of_week_nights')
plt.xlabel("Number of Week Nights")
plt.ylabel('Count');
The most common number of week nights is 2. The data appears somewhat normally distributed around two but with a long right tail of outliers on the upper end (up to a maximum of 17 week nights).
# Use a countplot to visualize the distribution of meal plans.
plt.title('Countplot: Type of Meal Plan')
sns.countplot(data=data, x='type_of_meal_plan', order=['Not Selected','Meal Plan 1','Meal Plan 2','Meal Plan3'])
plt.xlabel("Type of Meal Plan")
plt.ylabel('Count');
Meal Plan 1 (breakfast) is the most common meal plan by far, followed by those who did not select a meal plan and then Meal Plan 2 (breakfast and one other meal). Meal Plan 3 (breakfast, lunch, and dinner) is chosen by very few guests.
# Use a countplot to visualize the distribution of bookings requiring a car parking space.
plt.title('Countplot: Required Car Parking Space')
sns.countplot(data=data, x='required_car_parking_space')
plt.xlabel("Required Car Parking Space")
plt.ylabel('Count');
The vast majority of guests do not require a car parking space.
# Use a countplot to visualize the distribution of room types reserved.
plt.figure(figsize=(15, 5))
plt.title('Countplot: Room Type Reserved')
sns.countplot(data=data, x='room_type_reserved', order=['Room_Type 1','Room_Type 2','Room_Type 3','Room_Type 4','Room_Type 5','Room_Type 6','Room_Type 7'])
plt.xlabel("Room Type Reserved")
plt.ylabel('Count');
Room Type 1 is the most common selection by far, followed by Room Type 4. Very few guests select the other room types.
# Create a user-defined function to display histogram & boxplot together for numeric variables.
def histogram_boxplot(data, feature, figsize=(18, 5), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (18,5))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.3, 0.7)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Visualize histogram & boxplot for lead time variable.
histogram_boxplot(data=data, feature='lead_time')
Lead time is heavily right skewed, with a high peak near zero, median around 60 and mean around 85 days.
# Use a countplot to visualize the distribution of arrival year.
plt.title('Countplot: Arrival Year')
sns.countplot(data=data, x='arrival_year')
plt.xlabel("Arrival Year")
plt.ylabel('Count');
2018 is the most common arrival year in the data set.
# Use a countplot to visualize the distribution of repeated guests.
plt.title('Countplot: Repeated Guests')
sns.countplot(data=data, x='repeated_guest')
plt.xlabel("Repeated Guests")
plt.ylabel('Count');
The vast majority of guests are not repeat guests.
# Visualize histogram & boxplot for number of previous cancellations.
histogram_boxplot(data=data, feature='no_of_previous_cancellations')
The vast majority of guests do not have any previous cancellations. The distribution is heavily right skewed with many outliers on the upper end.
# Visualize histogram & boxplot for number of previous bookings not canceled.
histogram_boxplot(data=data, feature='no_of_previous_bookings_not_canceled')
The vast majority of guests do not have any previous bookings, or they have between 0-3 prior bookings. The distribution is extremely right skewed with a significant number of outliers on the upper end.
# Visualize histogram & boxplot for average price per room.
histogram_boxplot(data=data, feature='avg_price_per_room')
The average room price distribution roughly follows a bell shape with a significant number of rooms priced at zero as well as a significant right-sided skewness with a wide range of outliers at the upper end.
# Use a countplot to visualize the distribution of number of special requests.
plt.title('Countplot: Number of Special Requests')
sns.countplot(data=data, x='no_of_special_requests')
plt.xlabel("Number of Special Requests")
plt.ylabel('Count');
The majority of guests do not have any special requests. The next most common is 1 special request, followed by 2 then 3. It is rare for guests to have 4 or 5 special requests.
# Use countplot with hue parameter to visualize booking status by month.
plt.title('Countplot: Booking Status by Arrival Month')
sns.countplot(data=data, x='arrival_month', hue='booking_status')
plt.xlabel("Arrival Month")
plt.ylabel('Count');
It appears that the highest proportion of cancellations happens in July and June, compared to a very low share of bookings which are cancelled in January and December.
# Use boxplot to visualize the changes in average room price across different months.
plt.figure(figsize=(15, 5))
plt.title('Boxplot: Average Price per Room by Arrival Month')
sns.boxplot(data=data, x='arrival_month', y='avg_price_per_room')
plt.xlabel("Arrival Month")
plt.ylabel('Average Price per Room (euros)');
There appears to be a slight increase in average room price during the middle of the year.
# Use a boxplot to visualize the relationship between room price and booking status.
plt.title('Boxplot: Average Price per Room by Booking Status')
sns.boxplot(data=data,x='booking_status',y='avg_price_per_room')
plt.xlabel("Booking Status")
plt.ylabel('Average Price per Room (euros)');
The average room price does not appear to have a significant impact on the likelihood of cancellation.
# Use a boxplot to visualize the relationship between lead time and booking status.
plt.title('Boxplot: Lead Time by Booking Status')
sns.boxplot(data=data,x='booking_status',y='lead_time')
plt.xlabel("Booking Status")
plt.ylabel('Lead Time (days)');
It appears that longer lead times increase the likelihood of cancellation.
# Use countplot with hue parameter to visualize booking status across different market segments.
plt.title('Countplot: Booking Status by Market Segment')
sns.countplot(data=data, x='market_segment_type', hue='booking_status')
plt.xlabel("Market Segment Type")
plt.ylabel('Count');
Online bookings appear to be proportionately more likely to be cancelled compared to other market segments. Corporate and complementary bookings appear least likely to be canceled.
# Make a new DataFrame to encode booking status to 1 for Canceled bookings and 0 for Not_Canceled.
data1 = data.copy()
data1['booking_status'] = data["booking_status"].apply(lambda x: 1 if x == "Canceled" else 0)
# Use heatmap to visualize correlation between numeric variables.
plt.figure(figsize=(18, 8))
sns.heatmap(data1.corr(), annot=True);
The highest correlation exists between the variables repeated_guest and number of previous bookings, which makes sense given that repeated guests have multiple previous bookings by definition. There is also a relatively high correlation between number of previous cancellations and number of previous bookings not cancelled, which again is no surprise given that they both fall under the category of repeat guests. One interesting correlation is between lead time and booking status, which supports the hypothesis that bookings made farther in advance are more likely to be cancelled.
# Check for null values in the data.
data1.isnull().sum()
no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
There are no null values to be treated.
# Use boxplot to visualize outliers.
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There are several variables with outliers, however they are proper values and therefore will not be treated.
# Separate the x & y variables and split each into training and testing sets.
X = data1.drop(["booking_status"], axis=1)
Y = data1["booking_status"]
X = sm.add_constant(X)
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# Print the features of the training and test sets.
print("Shape of training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of training set : (25392, 28) Shape of test set : (10883, 28) Percentage of classes in training set: 0 0.67064 1 0.32936 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.67638 1 0.32362 Name: booking_status, dtype: float64
In the original data, about 33% of bookings were Canceled (class 1), and this distribution is preserved in the training and test sets.
# Create a user-defined function to display a DataFrame of model performance metrics.
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute accuracy
recall = recall_score(target, pred) # to compute recall
precision = precision_score(target, pred) # to compute precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a DataFrame of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Create a user-defined function to display a confusion matrix for classification models.
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Fit the initial logistic regression model.
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3293
Time: 00:54:34 Log-Likelihood: -10793.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.5923 120.817 -7.653 0.000 -1161.390 -687.795
no_of_adults 0.1135 0.038 3.017 0.003 0.040 0.187
no_of_children 0.1563 0.057 2.732 0.006 0.044 0.268
no_of_weekend_nights 0.1068 0.020 5.398 0.000 0.068 0.146
no_of_week_nights 0.0398 0.012 3.239 0.001 0.016 0.064
required_car_parking_space -1.5939 0.138 -11.561 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.868 0.000 0.015 0.016
arrival_year 0.4570 0.060 7.633 0.000 0.340 0.574
arrival_month -0.0415 0.006 -6.418 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.252 0.801 -0.003 0.004
repeated_guest -2.3469 0.617 -3.805 0.000 -3.556 -1.138
no_of_previous_cancellations 0.2664 0.086 3.108 0.002 0.098 0.434
no_of_previous_bookings_not_canceled -0.1727 0.153 -1.131 0.258 -0.472 0.127
avg_price_per_room 0.0188 0.001 25.404 0.000 0.017 0.020
no_of_special_requests -1.4690 0.030 -48.790 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1768 0.067 2.654 0.008 0.046 0.307
type_of_meal_plan_Meal Plan 3 17.8379 5057.795 0.004 0.997 -9895.257 9930.933
type_of_meal_plan_Not Selected 0.2782 0.053 5.245 0.000 0.174 0.382
room_type_reserved_Room_Type 2 -0.3610 0.131 -2.761 0.006 -0.617 -0.105
room_type_reserved_Room_Type 3 -0.0009 1.310 -0.001 0.999 -2.569 2.567
room_type_reserved_Room_Type 4 -0.2821 0.053 -5.305 0.000 -0.386 -0.178
room_type_reserved_Room_Type 5 -0.7176 0.209 -3.432 0.001 -1.127 -0.308
room_type_reserved_Room_Type 6 -0.9456 0.147 -6.434 0.000 -1.234 -0.658
room_type_reserved_Room_Type 7 -1.3964 0.293 -4.767 0.000 -1.971 -0.822
market_segment_type_Complementary -41.8798 8.42e+05 -4.98e-05 1.000 -1.65e+06 1.65e+06
market_segment_type_Corporate -1.1935 0.266 -4.487 0.000 -1.715 -0.672
market_segment_type_Offline -2.1955 0.255 -8.625 0.000 -2.694 -1.697
market_segment_type_Online -0.3990 0.251 -1.588 0.112 -0.891 0.093
========================================================================================================
# Check the performance of the initial model.
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80604 | 0.63422 | 0.73975 | 0.68293 |
# Create a series to sort the independent variables by VIF.
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
vif_series.sort_values(ascending=False)
const 39468156.70600 market_segment_type_Online 71.17643 market_segment_type_Offline 64.11392 market_segment_type_Corporate 16.92844 market_segment_type_Complementary 4.50011 avg_price_per_room 2.05042 no_of_children 1.97823 room_type_reserved_Room_Type 6 1.97307 repeated_guest 1.78352 no_of_previous_bookings_not_canceled 1.65199 arrival_year 1.43083 no_of_previous_cancellations 1.39569 lead_time 1.39491 room_type_reserved_Room_Type 4 1.36152 no_of_adults 1.34815 arrival_month 1.27567 type_of_meal_plan_Not Selected 1.27218 type_of_meal_plan_Meal Plan 2 1.27185 no_of_special_requests 1.24728 room_type_reserved_Room_Type 7 1.11512 room_type_reserved_Room_Type 2 1.10144 no_of_week_nights 1.09567 no_of_weekend_nights 1.06948 required_car_parking_space 1.03993 room_type_reserved_Room_Type 5 1.02781 type_of_meal_plan_Meal Plan 3 1.02522 arrival_date 1.00674 room_type_reserved_Room_Type 3 1.00330 dtype: float64
The only variables with VIF > 5 are dummy variables, which do not need to be dropped.
# Drop the variable with the highest p-value (market_segment_type_Complementary) and re-check model performance.
X_train1 = X_train.drop(["market_segment_type_Complementary"], axis=1,)
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25365
Method: MLE Df Model: 26
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3284
Time: 00:55:15 Log-Likelihood: -10807.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.3903 120.609 -7.664 0.000 -1160.780 -688.000
no_of_adults 0.1065 0.038 2.839 0.005 0.033 0.180
no_of_children 0.1522 0.057 2.660 0.008 0.040 0.264
no_of_weekend_nights 0.1088 0.020 5.506 0.000 0.070 0.147
no_of_week_nights 0.0420 0.012 3.419 0.001 0.018 0.066
required_car_parking_space -1.5940 0.138 -11.559 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.895 0.000 0.015 0.016
arrival_year 0.4566 0.060 7.640 0.000 0.339 0.574
arrival_month -0.0421 0.006 -6.501 0.000 -0.055 -0.029
arrival_date 0.0004 0.002 0.203 0.839 -0.003 0.004
repeated_guest -2.3217 0.617 -3.761 0.000 -3.531 -1.112
no_of_previous_cancellations 0.2646 0.086 3.088 0.002 0.097 0.433
no_of_previous_bookings_not_canceled -0.1728 0.152 -1.136 0.256 -0.471 0.125
avg_price_per_room 0.0191 0.001 26.080 0.000 0.018 0.021
no_of_special_requests -1.4700 0.030 -48.847 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1683 0.067 2.528 0.011 0.038 0.299
type_of_meal_plan_Meal Plan 3 2.2379 1.788 1.252 0.211 -1.266 5.742
type_of_meal_plan_Not Selected 0.2839 0.053 5.355 0.000 0.180 0.388
room_type_reserved_Room_Type 2 -0.3577 0.131 -2.737 0.006 -0.614 -0.102
room_type_reserved_Room_Type 3 -0.0928 1.251 -0.074 0.941 -2.545 2.359
room_type_reserved_Room_Type 4 -0.2805 0.053 -5.276 0.000 -0.385 -0.176
room_type_reserved_Room_Type 5 -0.7338 0.208 -3.521 0.000 -1.142 -0.325
room_type_reserved_Room_Type 6 -0.9616 0.147 -6.546 0.000 -1.250 -0.674
room_type_reserved_Room_Type 7 -1.4362 0.292 -4.916 0.000 -2.009 -0.864
market_segment_type_Corporate -0.6806 0.250 -2.719 0.007 -1.171 -0.190
market_segment_type_Offline -1.6774 0.238 -7.053 0.000 -2.144 -1.211
market_segment_type_Online 0.1111 0.235 0.473 0.636 -0.349 0.572
========================================================================================================
# Drop the variable with the highest p-value (room_type_reserved_Room_Type 3) and re-check model performance.
X_train2 = X_train1.drop(["room_type_reserved_Room_Type 3"], axis=1,)
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit(disp=False)
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25366
Method: MLE Df Model: 25
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3284
Time: 00:55:23 Log-Likelihood: -10807.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.3678 120.609 -7.664 0.000 -1160.757 -687.979
no_of_adults 0.1065 0.038 2.839 0.005 0.033 0.180
no_of_children 0.1522 0.057 2.660 0.008 0.040 0.264
no_of_weekend_nights 0.1088 0.020 5.506 0.000 0.070 0.148
no_of_week_nights 0.0420 0.012 3.420 0.001 0.018 0.066
required_car_parking_space -1.5940 0.138 -11.559 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.895 0.000 0.015 0.016
arrival_year 0.4566 0.060 7.640 0.000 0.339 0.574
arrival_month -0.0421 0.006 -6.502 0.000 -0.055 -0.029
arrival_date 0.0004 0.002 0.203 0.839 -0.003 0.004
repeated_guest -2.3216 0.617 -3.761 0.000 -3.531 -1.112
no_of_previous_cancellations 0.2646 0.086 3.088 0.002 0.097 0.433
no_of_previous_bookings_not_canceled -0.1728 0.152 -1.136 0.256 -0.471 0.125
avg_price_per_room 0.0191 0.001 26.080 0.000 0.018 0.021
no_of_special_requests -1.4700 0.030 -48.849 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1684 0.067 2.529 0.011 0.038 0.299
type_of_meal_plan_Meal Plan 3 2.2381 1.788 1.252 0.211 -1.266 5.742
type_of_meal_plan_Not Selected 0.2839 0.053 5.356 0.000 0.180 0.388
room_type_reserved_Room_Type 2 -0.3577 0.131 -2.737 0.006 -0.614 -0.102
room_type_reserved_Room_Type 4 -0.2805 0.053 -5.275 0.000 -0.385 -0.176
room_type_reserved_Room_Type 5 -0.7338 0.208 -3.520 0.000 -1.142 -0.325
room_type_reserved_Room_Type 6 -0.9616 0.147 -6.546 0.000 -1.250 -0.674
room_type_reserved_Room_Type 7 -1.4362 0.292 -4.916 0.000 -2.009 -0.864
market_segment_type_Corporate -0.6803 0.250 -2.718 0.007 -1.171 -0.190
market_segment_type_Offline -1.6772 0.238 -7.053 0.000 -2.143 -1.211
market_segment_type_Online 0.1114 0.235 0.474 0.635 -0.349 0.572
========================================================================================================
# Drop the variable with the highest p-value (arrival_date) and re-check model performance.
X_train3 = X_train2.drop(["arrival_date"], axis=1,)
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit(disp=False)
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25367
Method: MLE Df Model: 24
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3284
Time: 00:55:31 Log-Likelihood: -10807.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -924.3411 120.612 -7.664 0.000 -1160.737 -687.945
no_of_adults 0.1067 0.038 2.845 0.004 0.033 0.180
no_of_children 0.1523 0.057 2.663 0.008 0.040 0.264
no_of_weekend_nights 0.1089 0.020 5.514 0.000 0.070 0.148
no_of_week_nights 0.0419 0.012 3.418 0.001 0.018 0.066
required_car_parking_space -1.5941 0.138 -11.560 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.907 0.000 0.015 0.016
arrival_year 0.4566 0.060 7.639 0.000 0.339 0.574
arrival_month -0.0421 0.006 -6.524 0.000 -0.055 -0.029
repeated_guest -2.3226 0.617 -3.762 0.000 -3.533 -1.112
no_of_previous_cancellations 0.2645 0.086 3.086 0.002 0.097 0.433
no_of_previous_bookings_not_canceled -0.1727 0.152 -1.136 0.256 -0.471 0.125
avg_price_per_room 0.0191 0.001 26.081 0.000 0.018 0.021
no_of_special_requests -1.4698 0.030 -48.858 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1687 0.067 2.534 0.011 0.038 0.299
type_of_meal_plan_Meal Plan 3 2.2395 1.787 1.253 0.210 -1.262 5.741
type_of_meal_plan_Not Selected 0.2840 0.053 5.358 0.000 0.180 0.388
room_type_reserved_Room_Type 2 -0.3573 0.131 -2.734 0.006 -0.613 -0.101
room_type_reserved_Room_Type 4 -0.2803 0.053 -5.273 0.000 -0.385 -0.176
room_type_reserved_Room_Type 5 -0.7336 0.208 -3.520 0.000 -1.142 -0.325
room_type_reserved_Room_Type 6 -0.9615 0.147 -6.546 0.000 -1.249 -0.674
room_type_reserved_Room_Type 7 -1.4359 0.292 -4.915 0.000 -2.008 -0.863
market_segment_type_Corporate -0.6798 0.250 -2.716 0.007 -1.170 -0.189
market_segment_type_Offline -1.6777 0.238 -7.055 0.000 -2.144 -1.212
market_segment_type_Online 0.1111 0.235 0.473 0.636 -0.349 0.571
========================================================================================================
# Drop the variable with the highest p-value (market_segment_type_Online) and re-check model performance.
X_train4 = X_train3.drop(["market_segment_type_Online"], axis=1,)
logit4 = sm.Logit(y_train, X_train4.astype(float))
lg4 = logit4.fit(disp=False)
print(lg4.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25368
Method: MLE Df Model: 23
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3284
Time: 00:55:53 Log-Likelihood: -10807.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -921.5770 120.475 -7.650 0.000 -1157.704 -685.450
no_of_adults 0.1086 0.037 2.914 0.004 0.036 0.182
no_of_children 0.1527 0.057 2.670 0.008 0.041 0.265
no_of_weekend_nights 0.1088 0.020 5.511 0.000 0.070 0.148
no_of_week_nights 0.0418 0.012 3.409 0.001 0.018 0.066
required_car_parking_space -1.5952 0.138 -11.570 0.000 -1.865 -1.325
lead_time 0.0157 0.000 59.180 0.000 0.015 0.016
arrival_year 0.4553 0.060 7.625 0.000 0.338 0.572
arrival_month -0.0423 0.006 -6.562 0.000 -0.055 -0.030
repeated_guest -2.3315 0.617 -3.781 0.000 -3.540 -1.123
no_of_previous_cancellations 0.2654 0.086 3.097 0.002 0.097 0.433
no_of_previous_bookings_not_canceled -0.1726 0.152 -1.135 0.257 -0.471 0.126
avg_price_per_room 0.0191 0.001 26.344 0.000 0.018 0.021
no_of_special_requests -1.4694 0.030 -48.871 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1674 0.067 2.516 0.012 0.037 0.298
type_of_meal_plan_Meal Plan 3 2.1843 1.753 1.246 0.213 -1.251 5.620
type_of_meal_plan_Not Selected 0.2858 0.053 5.405 0.000 0.182 0.389
room_type_reserved_Room_Type 2 -0.3561 0.131 -2.726 0.006 -0.612 -0.100
room_type_reserved_Room_Type 4 -0.2822 0.053 -5.322 0.000 -0.386 -0.178
room_type_reserved_Room_Type 5 -0.7348 0.208 -3.527 0.000 -1.143 -0.326
room_type_reserved_Room_Type 6 -0.9644 0.147 -6.571 0.000 -1.252 -0.677
room_type_reserved_Room_Type 7 -1.4409 0.292 -4.936 0.000 -2.013 -0.869
market_segment_type_Corporate -0.7877 0.103 -7.664 0.000 -0.989 -0.586
market_segment_type_Offline -1.7874 0.052 -34.398 0.000 -1.889 -1.686
========================================================================================================
# Drop the variable with the highest p-value (no_of_previous_bookings_not_canceled) and re-check model performance.
X_train5 = X_train4.drop(["no_of_previous_bookings_not_canceled"], axis=1,)
logit5 = sm.Logit(y_train, X_train5.astype(float))
lg5 = logit5.fit(disp=False)
print(lg5.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25369
Method: MLE Df Model: 22
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3283
Time: 00:56:01 Log-Likelihood: -10808.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -919.9661 120.487 -7.635 0.000 -1156.117 -683.816
no_of_adults 0.1086 0.037 2.913 0.004 0.036 0.182
no_of_children 0.1527 0.057 2.670 0.008 0.041 0.265
no_of_weekend_nights 0.1088 0.020 5.508 0.000 0.070 0.147
no_of_week_nights 0.0418 0.012 3.406 0.001 0.018 0.066
required_car_parking_space -1.5946 0.138 -11.564 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.223 0.000 0.015 0.016
arrival_year 0.4545 0.060 7.611 0.000 0.337 0.571
arrival_month -0.0423 0.006 -6.561 0.000 -0.055 -0.030
repeated_guest -2.7372 0.557 -4.916 0.000 -3.828 -1.646
no_of_previous_cancellations 0.2289 0.077 2.984 0.003 0.079 0.379
avg_price_per_room 0.0192 0.001 26.354 0.000 0.018 0.021
no_of_special_requests -1.4699 0.030 -48.890 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1664 0.067 2.502 0.012 0.036 0.297
type_of_meal_plan_Meal Plan 3 2.1848 1.754 1.245 0.213 -1.254 5.623
type_of_meal_plan_Not Selected 0.2858 0.053 5.405 0.000 0.182 0.389
room_type_reserved_Room_Type 2 -0.3564 0.131 -2.728 0.006 -0.613 -0.100
room_type_reserved_Room_Type 4 -0.2824 0.053 -5.326 0.000 -0.386 -0.178
room_type_reserved_Room_Type 5 -0.7349 0.208 -3.528 0.000 -1.143 -0.327
room_type_reserved_Room_Type 6 -0.9649 0.147 -6.574 0.000 -1.253 -0.677
room_type_reserved_Room_Type 7 -1.4416 0.292 -4.938 0.000 -2.014 -0.869
market_segment_type_Corporate -0.7925 0.103 -7.709 0.000 -0.994 -0.591
market_segment_type_Offline -1.7875 0.052 -34.400 0.000 -1.889 -1.686
==================================================================================================
# Drop the variable with the highest p-value (type_of_meal_plan_Meal Plan 3) and re-check model performance.
X_train6 = X_train5.drop(["type_of_meal_plan_Meal Plan 3"], axis=1,)
logit6 = sm.Logit(y_train, X_train6.astype(float))
lg6 = logit6.fit(disp=False)
print(lg6.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 04 Nov 2022 Pseudo R-squ.: 0.3283
Time: 00:56:07 Log-Likelihood: -10809.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -917.2860 120.456 -7.615 0.000 -1153.376 -681.196
no_of_adults 0.1086 0.037 2.914 0.004 0.036 0.182
no_of_children 0.1522 0.057 2.660 0.008 0.040 0.264
no_of_weekend_nights 0.1086 0.020 5.501 0.000 0.070 0.147
no_of_week_nights 0.0418 0.012 3.403 0.001 0.018 0.066
required_car_parking_space -1.5943 0.138 -11.561 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.218 0.000 0.015 0.016
arrival_year 0.4531 0.060 7.591 0.000 0.336 0.570
arrival_month -0.0424 0.006 -6.568 0.000 -0.055 -0.030
repeated_guest -2.7365 0.557 -4.915 0.000 -3.828 -1.645
no_of_previous_cancellations 0.2289 0.077 2.983 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.343 0.000 0.018 0.021
no_of_special_requests -1.4699 0.030 -48.892 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1654 0.067 2.487 0.013 0.035 0.296
type_of_meal_plan_Not Selected 0.2858 0.053 5.405 0.000 0.182 0.389
room_type_reserved_Room_Type 2 -0.3560 0.131 -2.725 0.006 -0.612 -0.100
room_type_reserved_Room_Type 4 -0.2826 0.053 -5.330 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7352 0.208 -3.529 0.000 -1.143 -0.327
room_type_reserved_Room_Type 6 -0.9650 0.147 -6.572 0.000 -1.253 -0.677
room_type_reserved_Room_Type 7 -1.4312 0.293 -4.892 0.000 -2.005 -0.858
market_segment_type_Corporate -0.7928 0.103 -7.711 0.000 -0.994 -0.591
market_segment_type_Offline -1.7867 0.052 -34.391 0.000 -1.889 -1.685
==================================================================================================
Now no feature has p-value greater than 0.05, so the features in X_train6 will be the final ones and lg6 will be the final model.
# Check the performance of the model after dropping insignificant variables.
print("Training performance:")
model_performance_classification_statsmodels(lg6, X_train6, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80541 | 0.63255 | 0.73903 | 0.68166 |
The values for the performance measures are comparable to those of the original model.
# Convert coefficients to odds.
odds = np.exp(lg6.params)
# Find the percent change.
perc_change_odds = (odds - 1) * 100
# Create a DataFrame with odds.
pd.DataFrame({"Odds": odds, "% change": perc_change_odds}, index=X_train6.columns).sort_values('% change',ascending=False)
| Odds | % change | |
|---|---|---|
| arrival_year | 1.57324 | 57.32351 |
| type_of_meal_plan_Not Selected | 1.33089 | 33.08924 |
| no_of_previous_cancellations | 1.25716 | 25.71567 |
| type_of_meal_plan_Meal Plan 2 | 1.17992 | 17.99156 |
| no_of_children | 1.16436 | 16.43601 |
| no_of_adults | 1.11475 | 11.47536 |
| no_of_weekend_nights | 1.11475 | 11.47526 |
| no_of_week_nights | 1.04264 | 4.26363 |
| avg_price_per_room | 1.01935 | 1.93479 |
| lead_time | 1.01584 | 1.58352 |
| arrival_month | 0.95853 | -4.14725 |
| room_type_reserved_Room_Type 4 | 0.75383 | -24.61701 |
| room_type_reserved_Room_Type 2 | 0.70046 | -29.95389 |
| room_type_reserved_Room_Type 5 | 0.47940 | -52.05967 |
| market_segment_type_Corporate | 0.45258 | -54.74162 |
| room_type_reserved_Room_Type 6 | 0.38099 | -61.90093 |
| room_type_reserved_Room_Type 7 | 0.23903 | -76.09669 |
| no_of_special_requests | 0.22994 | -77.00595 |
| required_car_parking_space | 0.20305 | -79.69523 |
| market_segment_type_Offline | 0.16750 | -83.24963 |
| repeated_guest | 0.06480 | -93.52026 |
| const | 0.00000 | -100.00000 |
Coefficient interpretations:
# Create a confusion matrix to check model performance on the training set.
confusion_matrix_statsmodels(lg6, X_train6, y_train)
The rate of true negatives is about 60%
The rate of false positives is about 7%
The rate of false negatives is about 12%
The rate of true positives is about 21%
# Display the model performance metrics DataFrame.
log_reg_model_train_perf = model_performance_classification_statsmodels(lg6, X_train6, y_train)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80541 | 0.63255 | 0.73903 | 0.68166 |
# Plot the ROC-AUC on the training data.
logit_roc_auc_train = roc_auc_score(y_train, lg6.predict(X_train6))
fpr, tpr, thresholds = roc_curve(y_train, lg6.predict(X_train6))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
The logistic regression model is giving a good performance on the training set.
# Find the optimal threshold from the AUC-ROC curve, where TPR is high and FPR is low.
fpr, tpr, thresholds = roc_curve(y_train, lg6.predict(X_train6))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.37104666234889655
# Display the confusion matrix using the optimal threshold.
confusion_matrix_statsmodels(lg6, X_train6, y_train, threshold=optimal_threshold_auc_roc)
The rate of true negatives is about 55% (worse than the default threshold)
The rate of false positives is about 12% (worse than default)
The rate of false negatives is about 9% (better than default)
The rate of true positives is about 24% (better than default)
# Check model performance using the optimal threshold.
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg6, X_train6, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79289 | 0.73562 | 0.66870 | 0.70056 |
Accuracy is about the same, recall has improved by about 10%, precision has decreased by 7% and F1 score has improved by 2%.
# Plot a precision-recall curve to check for a better threshold.
y_scores = lg6.predict(X_train6)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# Set the new threshold where precision and recall meet.
optimal_threshold_curve = 0.42
# Check the confusion matrix with the new threshold.
confusion_matrix_statsmodels(lg6, X_train6, y_train, threshold=optimal_threshold_curve)
The rate of true negatives is about 57% (in between the two other models)
The rate of false positives is about 10% (in between the others)
The rate of false negatives is about 10% (in between the others)
The rate of true positives is about 23% (in between the others)
# Check the performance measures DataFrame with the new threshold.
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg6, X_train6, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80128 | 0.69939 | 0.69789 | 0.69864 |
Accuracy and F1 score have stayed about the same while recall has decreased by about 3% and precision has increased by about 3% so that precision and recall are both about 70%. This model finds the best balance across the performance measures on the training data.
# Drop the columns that were dropped from the training set.
X_test6 = X_test[X_train6.columns].astype(float)
# Diplay confusion matrix for the default threshold on the test data.
confusion_matrix_statsmodels(lg6, X_test6, y_test)
The rate of true negatives is about 60%
The rate of false positives is about 8%
The rate of false negatives is about 12%
The rate of true positives is about 20%
# Display performance measures DataFrame.
log_reg_model_test_perf = model_performance_classification_statsmodels(lg6, X_test6, y_test)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80465 | 0.63089 | 0.72900 | 0.67641 |
# Plot ROC curve.
logit_roc_auc_train = roc_auc_score(y_test, lg6.predict(X_test6))
fpr, tpr, thresholds = roc_curve(y_test, lg6.predict(X_test6))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
The ROC curve shows that the model is performing well on the test data.
# Diplay confusion matrix.
confusion_matrix_statsmodels(lg6, X_test6, y_test, threshold=optimal_threshold_auc_roc)
The rate of true negatives is about 55% (worse than default threshold)
The rate of false positives is about 12% (worse than default)
The rate of false negatives is about 8% (better than default)
The rate of true positives is about 23% (better than default)
# Display performance measures DataFrame.
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg6, X_test6, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79601 | 0.73935 | 0.66667 | 0.70113 |
Just like on the training data, this threshold results in higher recall, lower precision and slightly improved F1 score.
# Diplay confusion matrix.
confusion_matrix_statsmodels(lg6, X_test6, y_test, threshold=optimal_threshold_curve)
The rate of true negatives is about 58% (in between the two other models)
The rate of false positives is about 10% (in between)
The rate of false negatives is about 10% (in between)
The rate of true positives is about 23% (in between)
# Display performance measures DataFrame.
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg6, X_test6, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80364 | 0.70386 | 0.69381 | 0.69880 |
Again this model finds balance between recall and precision.
# Create a DataFrame to summarize the performance measures of each model on the training data.
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression statsmodel",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression statsmodel | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80541 | 0.79289 | 0.80128 |
| Recall | 0.63255 | 0.73562 | 0.69939 |
| Precision | 0.73903 | 0.66870 | 0.69789 |
| F1 | 0.68166 | 0.70056 | 0.69864 |
# Create a DataFrame to summarize the performance measures of each model on the test data.
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression statsmodel",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression statsmodel | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80465 | 0.79601 | 0.80364 |
| Recall | 0.63089 | 0.73935 | 0.70386 |
| Precision | 0.72900 | 0.66667 | 0.69381 |
| F1 | 0.67641 | 0.70113 | 0.69880 |
# Prepare the data for decision tree modeling & split into training & test sets.
X = data1.drop(["booking_status"], axis=1)
Y = data1["booking_status"]
X = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# Build the decision tree model.
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
# Create a user-defined function to compute & display performance metrics.
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute accuracy
recall = recall_score(target, pred) # to compute recall
precision = precision_score(target, pred) # to compute precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a DataFrame of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Create a user-defined function to display a confusion matrix for the decision tree model.
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Display the confusion matrix for the default model on the training data.
confusion_matrix_sklearn(model0, X_train, y_train)
# Compute the performance measures on the training data.
decision_tree_perf_train = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99421 | 0.98661 | 0.99578 | 0.99117 |
As expected, the non-pruned tree overfits the training data, with all performance measures around 100%.
# Display the confusion matrix on the testing data.
confusion_matrix_sklearn(model0, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87108 | 0.81034 | 0.79521 | 0.80270 |
As expected, performance on the test set is significantly lower compared to the training set due to overfitting.
# Check feature importance for the non-pruned model.
feature_names = list(X_train.columns)
importances = model0.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Lead time has the highest importance in this model, followed by average price per room and online market segment type.
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)
# Check the confusion matrix for the pre-pruned tree on the training data.
confusion_matrix_sklearn(estimator, X_train, y_train)
# Check the performance measures for the pre-pruned tree on the training data.
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83101 | 0.78620 | 0.72428 | 0.75397 |
As expected, the pre-pruned model performs worse on the training data compared to the non-pruned tree.
# Check the confusion matrix for the pre-pruned model on the test data.
confusion_matrix_sklearn(estimator, X_test, y_test)
# Check the performance measures for the pre-pruned tree on the test data.
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83497 | 0.78336 | 0.72758 | 0.75444 |
The performance measures are slightly worse on the test data for the pre-pruned model compared to the default model, but they are closer to the model's performance on the training data, so it is a better model overall due to being more generalized.
# Visualize the decision tree.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Display the text report for the pre-pruned decision tree.
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- weights: [1736.39, 132.08] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- weights: [960.27, 223.16] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- weights: [129.73, 160.92] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- weights: [214.72, 227.72] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- weights: [82.76, 285.41] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- weights: [87.23, 81.98] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- weights: [228.14, 48.58] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- weights: [363.83, 132.08] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- weights: [219.94, 85.01] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- weights: [132.71, 280.85] class: 1 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- weights: [158.80, 159.40] class: 1 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- weights: [850.67, 3543.28] class: 1 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- weights: [15.66, 9.11] class: 0 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- weights: [32.06, 19.74] class: 0 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- weights: [498.03, 44.03] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- weights: [258.71, 63.76] class: 0 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2512.51, 1451.32] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [180.42, 57.69] class: 0 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [184.90, 56.17] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [106.61, 106.27] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [3.73, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [257.96, 62.24] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [213.97, 385.60] class: 1 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [23.86, 1030.80] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- weights: [7.46, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- weights: [37.28, 4.55] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- weights: [20.13, 212.54] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- weights: [231.12, 110.82] class: 0 | | | | | |--- arrival_month > 11.50 | | | | | | |--- weights: [19.38, 34.92] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# Check the feature importances for the pre-pruned model.
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In this model, number of special requests has risen in relative importance while average price per room has fallen in importance.
# Create a DataFrame to display alpha levels and corresponding impurities.
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | 0.00838 |
| 1 | -0.00000 | 0.00838 |
| 2 | 0.00000 | 0.00838 |
| 3 | 0.00000 | 0.00838 |
| 4 | 0.00000 | 0.00838 |
| ... | ... | ... |
| 1837 | 0.00890 | 0.32806 |
| 1838 | 0.00980 | 0.33786 |
| 1839 | 0.01272 | 0.35058 |
| 1840 | 0.03412 | 0.41882 |
| 1841 | 0.08118 | 0.50000 |
1842 rows × 2 columns
# Plot impurity vs. alpha for the training set.
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
# Train a decision tree using the effective alphas.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943
# Remove the last element with only one node and plot # of nodes vs. alpha as well as depth vs. alpha.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
# Compute the F1 score on the training and test data for each of the post-pruned models.
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
# Plot the F1 score vs. alpha on the training and test data.
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Return the alpha level that optimizes F1 score on the training and testing data.
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0001226763315516701, class_weight='balanced',
random_state=1)
# Display the confusion matrix for the best post-pruned model on the training data.
confusion_matrix_sklearn(best_model, X_train, y_train)
# Display the model performance measures for the best post-pruned model on the training data.
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.90005 | 0.90350 | 0.81361 | 0.85620 |
# Display the confusion matrix for the best post-pruned model on the test data.
confusion_matrix_sklearn(best_model, X_test, y_test)
# Display the model performance measures for the best post-pruned model on the test data.
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86869 | 0.85576 | 0.76595 | 0.80837 |
The post-pruned model is somewhat overfitting on the training data but not as significantly as the non-pruned model. The post-pruned model is performing better than the pre-pruned model on all performance measures.
# Display the post-pruned decision tree.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Print the text report for the post-pruned decision tree.
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- avg_price_per_room <= 196.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | |--- weights: [207.26, 10.63] class: 0 | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- weights: [2.24, 7.59] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [21.62, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [1199.59, 0.00] class: 0 | | | | | |--- avg_price_per_room > 196.50 | | | | | | |--- weights: [0.75, 25.81] class: 1 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- lead_time <= 68.50 | | | | | | |--- arrival_month <= 9.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- weights: [41.75, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- avg_price_per_room <= 59.75 | | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | | |--- weights: [1.49, 12.14] class: 1 | | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | | |--- weights: [14.91, 1.52] class: 0 | | | | | | | | | |--- avg_price_per_room > 59.75 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.75, 59.21] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | | |--- weights: [20.13, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | |--- arrival_month > 9.50 | | | | | | | |--- weights: [413.04, 27.33] class: 0 | | | | | |--- lead_time > 68.50 | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- lead_time <= 81.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 81.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- weights: [55.17, 3.04] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- weights: [21.62, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | | |--- weights: [9.69, 122.97] class: 1 | | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- avg_price_per_room <= 75.07 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | |--- no_of_previous_bookings_not_canceled <= 1.00 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- weights: [2.24, 118.41] class: 1 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- no_of_previous_bookings_not_canceled > 1.00 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- arrival_date <= 11.50 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- arrival_date > 11.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [23.11, 6.07] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [5.96, 9.11] class: 1 | | | | | | |--- avg_price_per_room > 75.07 | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | |--- weights: [59.64, 3.04] class: 0 | | | | | | | |--- arrival_month > 3.50 | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | |--- weights: [1.49, 16.70] class: 1 | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 86.00 | | | | | | | | | | | |--- weights: [2.24, 16.70] class: 1 | | | | | | | | | | |--- avg_price_per_room > 86.00 | | | | | | | | | | | |--- weights: [8.95, 3.04] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [44.73, 4.55] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 11.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- weights: [16.40, 39.47] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- weights: [20.13, 6.07] class: 0 | | | | | | |--- arrival_date > 11.50 | | | | | | | |--- avg_price_per_room <= 102.09 | | | | | | | | |--- weights: [5.22, 144.22] class: 1 | | | | | | | |--- avg_price_per_room > 102.09 | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.75, 16.70] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [33.55, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | |--- avg_price_per_room <= 124.25 | | | | | | | | | | |--- weights: [2.98, 75.91] class: 1 | | | | | | | | | |--- avg_price_per_room > 124.25 | | | | | | | | | | |--- weights: [3.73, 3.04] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [38.02, 0.00] class: 0 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [24.60, 3.04] class: 0 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [14.91, 72.87] class: 1 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [9.69, 1.52] class: 0 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [84.25, 0.00] class: 0 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [13.42, 13.66] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [0.00, 15.18] class: 1 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- weights: [58.15, 18.22] class: 0 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- weights: [61.88, 1.52] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 99.44 | | | | | |--- arrival_month <= 1.50 | | | | | | |--- weights: [92.45, 0.00] class: 0 | | | | | |--- arrival_month > 1.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 70.05 | | | | | | | | | |--- weights: [31.31, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.05 | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [38.77, 1.52] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- weights: [34.30, 40.99] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 74.21 | | | | | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | | | | | | |--- avg_price_per_room > 74.21 | | | | | | | | | | | |--- weights: [9.69, 0.00] class: 0 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [4.47, 10.63] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [155.07, 6.07] class: 0 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [3.73, 10.63] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | |--- avg_price_per_room > 99.44 | | | | | |--- lead_time <= 3.50 | | | | | | |--- avg_price_per_room <= 202.67 | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- weights: [63.37, 30.36] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | |--- weights: [115.56, 12.14] class: 0 | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [28.33, 3.04] class: 0 | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | |--- avg_price_per_room > 202.67 | | | | | | | |--- weights: [0.75, 22.77] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | |--- avg_price_per_room <= 118.50 | | | | | | | | | |--- weights: [18.64, 59.21] class: 1 | | | | | | | | |--- avg_price_per_room > 118.50 | | | | | | | | | |--- weights: [8.20, 1.52] class: 0 | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | |--- weights: [34.30, 171.55] class: 1 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [26.09, 1.52] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 14.00 | | | | | | | | | | |--- weights: [9.69, 36.43] class: 1 | | | | | | | | | |--- arrival_date > 14.00 | | | | | | | | | | |--- avg_price_per_room <= 208.67 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 208.67 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [15.66, 0.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- avg_price_per_room <= 71.92 | | | | | | |--- avg_price_per_room <= 59.43 | | | | | | | |--- lead_time <= 84.50 | | | | | | | | |--- weights: [50.70, 7.59] class: 0 | | | | | | | |--- lead_time > 84.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_date <= 27.00 | | | | | | | | | | |--- lead_time <= 131.50 | | | | | | | | | | | |--- weights: [0.75, 15.18] class: 1 | | | | | | | | | | |--- lead_time > 131.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 27.00 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- weights: [10.44, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 59.43 | | | | | | | |--- lead_time <= 25.50 | | | | | | | | |--- weights: [20.88, 6.07] class: 0 | | | | | | | |--- lead_time > 25.50 | | | | | | | | |--- avg_price_per_room <= 71.34 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- lead_time <= 68.50 | | | | | | | | | | | |--- weights: [15.66, 78.94] class: 1 | | | | | | | | | | |--- lead_time > 68.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- lead_time <= 102.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 102.00 | | | | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 71.34 | | | | | | | | | |--- weights: [11.18, 0.00] class: 0 | | | | | |--- avg_price_per_room > 71.92 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 120.45 | | | | | | | | | |--- weights: [79.77, 9.11] class: 0 | | | | | | | | |--- avg_price_per_room > 120.45 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- weights: [3.73, 12.14] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [16.40, 47.06] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [0.00, 63.76] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 104.31 | | | | | | | | |--- lead_time <= 25.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [16.40, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- weights: [38.77, 118.41] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [23.11, 0.00] class: 0 | | | | | | | | |--- lead_time > 25.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [39.51, 185.21] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [73.81, 411.41] class: 1 | | | | | | | |--- avg_price_per_room > 104.31 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | | | | |--- weights: [0.75, 138.15] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [11.18, 6.07] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.75, 9.11] class: 1 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | | |--- lead_time <= 22.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 22.00 | | | | | | | | | | | |--- weights: [17.15, 83.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | | |--- weights: [12.67, 6.07] class: 0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [48.46, 1.52] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 102.50 | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | |--- weights: [697.09, 9.11] class: 0 | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | |--- lead_time <= 63.00 | | | | | | | |--- weights: [15.66, 1.52] class: 0 | | | | | | |--- lead_time > 63.00 | | | | | | | |--- weights: [0.00, 7.59] class: 1 | | | | |--- lead_time > 102.50 | | | | | |--- no_of_week_nights <= 2.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [31.31, 13.66] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [0.75, 6.07] class: 1 | | | | | |--- no_of_week_nights > 2.50 | | | | | | |--- weights: [44.73, 3.04] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | |--- weights: [498.03, 40.99] class: 0 | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 13.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- weights: [58.90, 36.43] class: 0 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [33.55, 1.52] class: 0 | | | | | | |--- arrival_date > 13.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [123.76, 9.11] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- avg_price_per_room <= 126.33 | | | | | | | | | |--- weights: [32.80, 3.04] class: 0 | | | | | | | | |--- avg_price_per_room > 126.33 | | | | | | | | | |--- weights: [9.69, 13.66] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 118.55 | | | | | | | |--- lead_time <= 61.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [70.08, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [126.74, 1.52] class: 0 | | | | | | | |--- lead_time > 61.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [4.47, 57.69] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- avg_price_per_room <= 71.93 | | | | | | | | | | | |--- weights: [54.43, 3.04] class: 0 | | | | | | | | | | |--- avg_price_per_room > 71.93 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 118.55 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | | | |--- weights: [0.00, 6.07] class: 1 | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 121.20 | | | | | | | | | | | |--- weights: [18.64, 6.07] class: 0 | | | | | | | | | | |--- avg_price_per_room > 121.20 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- lead_time <= 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 55.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- weights: [11.93, 10.63] class: 0 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [37.28, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- avg_price_per_room <= 119.20 | | | | | | | | | | | |--- weights: [9.69, 28.84] class: 1 | | | | | | | | | | |--- avg_price_per_room > 119.20 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | | |--- weights: [49.95, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | | |--- weights: [0.75, 18.22] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [134.20, 1.52] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1585.04, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- lead_time <= 6.50 | | | | | | | | |--- weights: [32.06, 0.00] class: 0 | | | | | | | |--- lead_time > 6.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | |--- weights: [23.11, 1.52] class: 0 | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 93.09 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 93.09 | | | | | | | | | | | |--- weights: [77.54, 27.33] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [19.38, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [52.19, 0.00] class: 0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- avg_price_per_room <= 202.95 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.49, 9.11] class: 1 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [175.20, 28.84] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | |--- avg_price_per_room > 202.95 | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- avg_price_per_room <= 153.15 | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | |--- avg_price_per_room <= 71.12 | | | | | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 71.12 | | | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- weights: [12.67, 7.59] class: 0 | | | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | | | |--- weights: [64.12, 60.72] class: 0 | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 153.15 | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [67.10, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- arrival_month <= 5.00 | | | | | | | |--- weights: [2.98, 0.00] class: 0 | | | | | | |--- arrival_month > 5.00 | | | | | | | |--- weights: [0.75, 24.29] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [46.97, 9.11] class: 0 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [0.00, 13.66] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- weights: [6.71, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- weights: [188.62, 7.59] class: 0 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- weights: [13.42, 27.33] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 285.50 | | | | | | | |--- weights: [8.20, 0.00] class: 0 | | | | | | |--- lead_time > 285.50 | | | | | | | |--- weights: [0.75, 3.04] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [0.75, 97.16] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.47 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- weights: [2.98, 282.37] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- lead_time <= 244.00 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | | |--- weights: [2.24, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | | |--- weights: [2.24, 57.69] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [17.89, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [11.18, 3.04] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [0.00, 12.14] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [75.30, 12.14] class: 0 | | | | | | | |--- lead_time > 244.00 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [25.35, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [11.18, 264.15] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [7.46, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [46.22, 0.00] class: 0 | | | | |--- avg_price_per_room > 82.47 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 324.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [7.46, 986.78] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 10.63] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [4.47, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | |--- weights: [0.00, 19.74] class: 1 | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | |--- weights: [5.22, 0.00] class: 0 | | | | | | |--- lead_time > 324.50 | | | | | | | |--- avg_price_per_room <= 89.00 | | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 89.00 | | | | | | | | |--- weights: [0.75, 13.66] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [5.22, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- lead_time <= 159.50 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- weights: [5.96, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- weights: [1.49, 7.59] class: 1 | | | | | |--- lead_time > 159.50 | | | | | | |--- arrival_date <= 1.50 | | | | | | | |--- weights: [1.49, 3.04] class: 1 | | | | | | |--- arrival_date > 1.50 | | | | | | | |--- weights: [35.79, 1.52] class: 0 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- weights: [12.67, 3.04] class: 0 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [0.00, 3.04] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- weights: [7.46, 206.46] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [8.95, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | |--- weights: [46.97, 4.55] class: 0 | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | |--- no_of_week_nights <= 6.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | |--- lead_time <= 152.50 | | | | | | | | | | | |--- weights: [1.49, 4.55] class: 1 | | | | | | | | | | |--- lead_time > 152.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | |--- weights: [23.11, 19.74] class: 0 | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [2.24, 15.18] class: 1 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | | |--- weights: [0.00, 4.55] class: 1 | | | | | | | |--- no_of_week_nights > 6.50 | | | | | | | | |--- weights: [4.47, 13.66] class: 1 | | | | | |--- arrival_month > 11.50 | | | | | | |--- arrival_date <= 14.50 | | | | | | | |--- weights: [8.20, 3.04] class: 0 | | | | | | |--- arrival_date > 14.50 | | | | | | | |--- weights: [11.18, 31.88] class: 1 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- lead_time <= 348.50 | | | | | | |--- weights: [106.61, 3.04] class: 0 | | | | | |--- lead_time > 348.50 | | | | | | |--- weights: [5.96, 4.55] class: 0 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3200.19] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [23.11, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [35.04, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [3.73, 0.00] class: 0 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [3.73, 22.77] class: 1
# Plot the feature importances for the post-pruned tree.
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The most important features are similar to those of the first 2 decision tree models, with lead time being most important followed by online market segment type, average price per room and number of special requests.
# Summarize the performance measures for all 3 decision trees on the training data.
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree without pruning",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree without pruning | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99421 | 0.83101 | 0.90005 |
| Recall | 0.98661 | 0.78620 | 0.90350 |
| Precision | 0.99578 | 0.72428 | 0.81361 |
| F1 | 0.99117 | 0.75397 | 0.85620 |
# Summarize the performance measures for all 3 decision trees on the test data.
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree without pruning",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree without pruning | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.87108 | 0.83497 | 0.86869 |
| Recall | 0.81034 | 0.78336 | 0.85576 |
| Precision | 0.79521 | 0.72758 | 0.76595 |
| F1 | 0.80270 | 0.75444 | 0.80837 |
The post-pruned tree has a similar F1 score to the default (non-pruned) tree without suffering from as much overfitting. The pre-pruned tree has the worst performance on the test data for all performance measures.
Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).
OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.
In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.
The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:
The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.
# Import libraries for data manipulation.
import pandas as pd
import numpy as np
# Import libraries for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Import library to split data.
from sklearn.model_selection import train_test_split
# Import predictive model-building libraries.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
# Import functions to evaluate model performance.
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
# Import GridSearch to tune different models.
from sklearn.model_selection import GridSearchCV
# Import ensemble methods.
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
# Import libraries to ignore irrelevant warnings.
import warnings
warnings.filterwarnings('ignore')
# Set the precision of floating numbers to 3 decimal points.
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Load the data set.
data = pd.read_csv('/content/drive/My Drive/Data science/Data sets/EasyVisa.csv')
# Use .shape attribute to display the number of rows & columns in the data set.
data.shape
(25480, 12)
The data set has 25,480 rows and 12 columns.
# Display a sample of 5 rows from the data set to get a general idea of the information & make sure it is loaded properly.
data.sample(5)
| case_id | continent | education_of_employee | has_job_experience | requires_job_training | no_of_employees | yr_of_estab | region_of_employment | prevailing_wage | unit_of_wage | full_time_position | case_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19030 | EZYV19031 | Asia | Bachelor's | N | N | 1001 | 1998 | Northeast | 53177.380 | Year | Y | Certified |
| 9736 | EZYV9737 | Asia | Bachelor's | N | N | 1627 | 2011 | Northeast | 50743.320 | Year | Y | Denied |
| 5987 | EZYV5988 | Asia | Master's | Y | N | 2690 | 2007 | West | 85269.340 | Year | Y | Certified |
| 22181 | EZYV22182 | Asia | Master's | N | N | 3120 | 1908 | Northeast | 145360.140 | Year | N | Certified |
| 20478 | EZYV20479 | Asia | Bachelor's | N | N | 1695 | 2005 | South | 56755.660 | Year | Y | Certified |
# Check the number of unique values in the case_id column to see if it should be dropped.
data.case_id.nunique()
25480
# All the case_id values are unique, so the column can be dropped as it will not aid in analysis.
data1 = data.drop('case_id', axis=1)
# Use .info() method to check the non-null counts and data types of each column.
data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25480 entries, 0 to 25479 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 continent 25480 non-null object 1 education_of_employee 25480 non-null object 2 has_job_experience 25480 non-null object 3 requires_job_training 25480 non-null object 4 no_of_employees 25480 non-null int64 5 yr_of_estab 25480 non-null int64 6 region_of_employment 25480 non-null object 7 prevailing_wage 25480 non-null float64 8 unit_of_wage 25480 non-null object 9 full_time_position 25480 non-null object 10 case_status 25480 non-null object dtypes: float64(1), int64(2), object(8) memory usage: 2.1+ MB
None of the columns have null values.
Columns of object type: continent, education_of_employee, has_job_experience, requires_job_training, region_of_employment, unit_of_wage, full_time_position, case_status
Columns of numeric type: no_of_employees, yr_of_estab, prevailing_wage
# Make a copy of the data to convert year established into years since established for improved statistical analysis.
data2 = data1.copy()
data2['yrs_since_estab'] = 2022-data2['yr_of_estab']
data2 = data2.drop(['yr_of_estab'], axis=1)
# Check the statistical summary for the numeric columns
data2.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_employees | 25480.000 | 5667.043 | 22877.929 | -26.000 | 1022.000 | 2109.000 | 3504.000 | 602069.000 |
| prevailing_wage | 25480.000 | 74455.815 | 52815.942 | 2.137 | 34015.480 | 70308.210 | 107735.513 | 319210.270 |
| yrs_since_estab | 25480.000 | 42.590 | 42.367 | 6.000 | 17.000 | 25.000 | 46.000 | 222.000 |
The number of employees ranges from -26 to 602,069. Negative values here will need to be treated as they are inappropriate. The mean number of employees is 5667 and the median is 2109; since the mean is larger than the median, the distribution is skewed right. The prevailing wage cannot be easily interpreted from statistical analysis as the units differ across the data. The years since established ranges from 6 to 222 years, with a median of 25 years and mean of 42.5 years; this distribution is skewed right.
# Check the number of rows where the number of employees is negative.
data2[data2['no_of_employees']<0].shape[0]
33
There are relatively few rows that need to be treated for negative values. It can be assumed that these values were erroneously entered as negatives and should be converted to their absolute values.
# Replace all values in the column no_of_employees with their absolute values.
data2["no_of_employees"] = np.abs(data2["no_of_employees"])
# Double check the number of rows where the number of employees is negative.
data2[data2['no_of_employees']<0].shape[0]
0
The negative values have been treated successfully.
# Check for duplicated values.
data2.duplicated().sum()
0
There are no duplicated values in the data set.
# Use a countplot to visualize the distribution of continents.
plt.figure(figsize=(15, 5))
plt.title('Countplot: Continents')
sns.countplot(data=data2, x='continent', order=data2['continent'].value_counts().reset_index()['index'].tolist())
plt.xlabel("Continent")
plt.ylabel('Count');
A significant portion of visa applicants are from the continent of Asia. The next most common continents are Europe and North America. The least common continents are South America, Africa and Oceania.
# Use a countplot to visualize the distribution of education level.
plt.title('Countplot: Education Level')
sns.countplot(data=data2, x='education_of_employee', order=['High School',"Bachelor's","Master's","Doctorate"])
plt.xlabel("Education Level")
plt.ylabel('Count');
The most common education level for visa applicants is a Bachelor's degree, with slightly fewer applicants having a Master's degree. Fewer applicants have only graduated high school, and the least common level of education is a doctorate.
# Use a countplot to visualize the distribution of work experience.
plt.title('Countplot: Work Experience')
sns.countplot(data=data2, x='has_job_experience')
plt.xlabel("Work Experience")
plt.ylabel('Count');
It is somewhat more common for applicants to have work experience versus not having work experience.
# Use a countplot to visualize the distribution of those who require job training.
plt.title('Countplot: Requiring Job Training')
sns.countplot(data=data2, x='requires_job_training')
plt.xlabel("Requires Job Training")
plt.ylabel('Count');
A large majority of visa applicants do not require job training.
# Create a user-defined function to display histogram & boxplot together for numeric variables.
def histogram_boxplot(data, feature, figsize=(18, 5), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (18,5))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.3, 0.7)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Visualize histogram & boxplot for number of employees.
histogram_boxplot(data=data2, feature='no_of_employees')
The distribution of number of employees is extremely right skewed.
# Visualize histogram & boxplot for years since established.
histogram_boxplot(data=data2, feature='yrs_since_estab')
The distribution of years since established is asymmetric, bimodal and significantly right skewed.
# Use a countplot to visualize the distribution of region of employment.
plt.title('Countplot: Region of Employment')
sns.countplot(data=data2, x='region_of_employment', order=data2['region_of_employment'].value_counts().reset_index()['index'].tolist())
plt.xlabel("Region of Employment")
plt.ylabel('Count');
Northeast, South and West regions have similar rates of employment while fewer visa applicants are employed in the midwest. Very few applicants are employed in island region(s).
# Visualize histogram & boxplot for prevailing wage.
histogram_boxplot(data=data2, feature='prevailing_wage')
There is a notably large bucket at the lowest end of the distribution which is otherwise nearly symmetrical though right-skewed. However it should be noted that wage units differ across rows and therefore interpretations should be made with caution. The large count of low-end wages could represent hourly wages.
# Use a countplot to visualize the distribution of unit of wage.
plt.title('Countplot: Unit of Wage')
sns.countplot(data=data2, x='unit_of_wage', order=['Hour','Week','Month','Year'])
plt.xlabel("Unit of Wage")
plt.ylabel('Count');
Yearly wage is by far the most common unit of wage. The next most common is hourly at a significantly lower count. Weekly and monthly wages are quite rare.
# Use a countplot to visualize the distribution of full time positions.
plt.title('Countplot: Full Time Position')
sns.countplot(data=data2, x='full_time_position')
plt.xlabel("Full Time Position")
plt.ylabel('Count');
The large majority of visa applicants work a full time position.
# Use a countplot to visualize the distribution of case status.
plt.title('Countplot: Case Status')
sns.countplot(data=data2, x='case_status')
plt.xlabel("Case Status")
plt.ylabel('Count');
About twice as many visa applications are certified compared to those which are denied.
# Make a new DataFrame to encode booking status to 1 for Certified visasa and 0 for Denied.
data3 = data2.copy()
data3['case_status'] = data["case_status"].apply(lambda x: 1 if x == "Certified" else 0)
# Use heatmap to visualize correlation between numeric variables.
plt.figure(figsize=(10, 8))
sns.heatmap(data3.corr(), annot=True);
There appears to be very little correlation between the numeric variables.
# Use countplot with hue parameter to visualize differences in visa certification status across different regions.
plt.title('Countplot: Visa Status by Region of Employment')
sns.countplot(data=data2, x='region_of_employment', hue='case_status', order=['Midwest','South','Northeast','West','Island'])
plt.xlabel("Region")
plt.ylabel('Count');
# Create a DataFrame to display % of visas certified for each region of employment.
cert = data2[data2['case_status']=='Certified']['region_of_employment'].value_counts().rename('certified').reset_index()
totals = data2['region_of_employment'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'region'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| region | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Midwest | 3253 | 4307 | 75.528 |
| 1 | South | 4913 | 7017 | 70.016 |
| 2 | Northeast | 4526 | 7195 | 62.905 |
| 3 | West | 4100 | 6586 | 62.253 |
| 4 | Island | 226 | 375 | 60.267 |
Applicants employed in the Midwest region are most likely to have their visa certified. Those employed in the island region are certified at the lowest rate.
# Use countplot with hue parameter to visualize differences in visa certification status based on job training requirements.
plt.title('Countplot: Visa Status by Job Training Requirement')
sns.countplot(data=data2, x='requires_job_training', hue='case_status')
plt.xlabel("Requires Job Training")
plt.ylabel('Count');
The rate of visa certification appears similar for those requiring job training vs. those who do not require job training.
# Create a DataFrame to display % of visas certified based on job training requirement.
cert = data2[data2['case_status']=='Certified']['requires_job_training'].value_counts().rename('certified').reset_index()
totals = data2['requires_job_training'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'training required'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| training required | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Y | 2006 | 2955 | 67.885 |
| 1 | N | 15012 | 22525 | 66.646 |
The DataFrame confirms that the rate of visa certification is similar for those who require job training vs. those who do not.
# Use countplot with hue parameter to visualize differences in visa certification status for those with and without full time positions.
plt.title('Countplot: Visa Status by Full Time Position')
sns.countplot(data=data2, x='full_time_position', hue='case_status')
plt.xlabel("Full Time Position")
plt.ylabel('Count');
# Create a DataFrame to display % of visas certified based on job training requirement.
cert = data2[data2['case_status']=='Certified']['full_time_position'].value_counts().rename('certified').reset_index()
totals = data2['full_time_position'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'full time'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| full time | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | N | 1855 | 2707 | 68.526 |
| 1 | Y | 15163 | 22773 | 66.583 |
The likelihood of visa certification does not differ greatly for those with full time employment vs. without.
# Visualize boxplot of years since established based on case status.
plt.figure(figsize=(18, 5))
plt.title('Boxplot: Visa Status by Years Since Established')
sns.boxplot(data=data2, x='yrs_since_estab', y='case_status')
plt.xlabel("Years Since Established")
plt.ylabel("Case Status");
The distributions by case status for years since established appear nearly identical, indicating that the number of years since the applicant's employer company was established likely has very little impact on visa certification. This aligns with the heatmap showing low correlation between case status and years since established.
# Visualize boxplot of number of employees based on case status.
plt.figure(figsize=(18, 5))
plt.title('Boxplot: Visa Status by Number of Employees')
sns.boxplot(data=data2, x='no_of_employees', y='case_status')
plt.xlabel("Number of Employees")
plt.ylabel("Case Status");
The distributions appear to be similar although it is difficult to visualize the distribution given the significant number of outliers. A similar distribution would be expected given the heatmap showing low correlation between case status and number of employees.
Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?
# Use countplot with hue parameter to visualize differences in visa certification status across different education levels.
plt.title('Countplot: Visa Status by Education Level')
sns.countplot(data=data2, x='education_of_employee', hue='case_status', order=['High School',"Bachelor's","Master's","Doctorate"])
plt.legend(loc='upper left')
plt.xlabel("Employee Education Level")
plt.ylabel('Count');
It appears that people with a higher level of education are more likely to have their visa certified compared to those with a lower level of education.
# Create a DataFrame to display % of visas certified for each education level.
cert = data2[data2['case_status']=='Certified']['education_of_employee'].value_counts().rename('certified').reset_index()
totals = data2['education_of_employee'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'education level'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| education level | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Doctorate | 1912 | 2192 | 87.226 |
| 1 | Master's | 7575 | 9634 | 78.628 |
| 2 | Bachelor's | 6367 | 10234 | 62.214 |
| 3 | High School | 1164 | 3420 | 34.035 |
The DataFrame confirms that applicants are more likely to have their visa certified with a higher education level.
How does the visa status vary across different continents?
# Use countplot with hue parameter to visualize differences in visa certification status across different continents.
plt.figure(figsize=(15, 5))
plt.title('Countplot: Visa Status by Continent')
sns.countplot(data=data2, x='continent', hue='case_status', order=['Europe','Africa','Asia','Oceania','North America','South America'])
plt.xlabel("Continent")
plt.ylabel('Count');
# Create a DataFrame to display % of visas certified for each continent.
cert = data2[data2['case_status']=='Certified']['continent'].value_counts().rename('certified').reset_index()
totals = data2['continent'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'continent'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| continent | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Europe | 2957 | 3732 | 79.234 |
| 1 | Africa | 397 | 551 | 72.051 |
| 2 | Asia | 11012 | 16861 | 65.310 |
| 3 | Oceania | 122 | 192 | 63.542 |
| 4 | North America | 2037 | 3292 | 61.877 |
| 5 | South America | 493 | 852 | 57.864 |
Applicants from Europe are the most likely to have their visa application certified, followed by those from Africa then Asia. Applicants from South America are the least likely to have their visa certified.
Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?
# Use countplot with hue parameter to visualize differences in visa certification status for those with and without work experience.
plt.title('Countplot: Visa Status by Work Experience')
sns.countplot(data=data2, x='has_job_experience', hue='case_status')
plt.xlabel("Work Experience")
plt.ylabel('Count');
It appears that people with work experience are much more likely to have their visa certified compared to those without work experience.
# Create a DataFrame to display % of visas certified for people with work experience vs. those without work experience.
cert = data2[data2['case_status']=='Certified']['has_job_experience'].value_counts().rename('certified').reset_index()
totals = data2['has_job_experience'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'job experience'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| job experience | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Y | 11024 | 14802 | 74.476 |
| 1 | N | 5994 | 10678 | 56.134 |
The DataFrame confirms that applicants with job experience are more likely to have their visa certified compared to those without job experience.
In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?
# Use countplot with hue parameter to visualize differences in visa certification status across different wage payment intervals.
plt.title('Countplot: Visa Status by Unit of Wage')
sns.countplot(data=data2, x='unit_of_wage', hue='case_status', order=['Hour','Week','Month','Year'])
plt.xlabel("Unit of Wage")
plt.ylabel('Count');
Applicants with a yearly wage are more likely to have their visa certified, while those with an hourly wage are more likely to have their visa application denied.
# Create a DataFrame to display % of visas certified across different wage payment intervals.
cert = data2[data2['case_status']=='Certified']['unit_of_wage'].value_counts().rename('certified').reset_index()
totals = data2['unit_of_wage'].value_counts().rename('total').reset_index()
cert_pct = pd.merge(cert,totals)
cert_pct.rename(columns={'index':'unit of wage'}, inplace=True)
cert_pct['pct_cert'] = cert_pct['certified']/cert_pct['total']*100
cert_pct = cert_pct.sort_values(by='pct_cert', ascending=False).reset_index(drop=True)
cert_pct
| unit of wage | certified | total | pct_cert | |
|---|---|---|---|---|
| 0 | Year | 16047 | 22962 | 69.885 |
| 1 | Week | 169 | 272 | 62.132 |
| 2 | Month | 55 | 89 | 61.798 |
| 3 | Hour | 747 | 2157 | 34.631 |
The DataFrame confirms that applicants with a yearly wage are most likely to have their visa certified, followed by those with weekly and monthly wages. Applicants with hourly wages are about half as likely to have their visa certified compared to those with a yearly wage.
The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?
# Visualize boxplot of prevailing wage based on case status.
plt.figure(figsize=(18, 5))
plt.title('Boxplot: Visa Status by Prevailing Wage')
sns.boxplot(data=data2, x='prevailing_wage', y='case_status')
plt.xlabel("Prevailing Wage")
plt.ylabel("Case Status");
The median prevailing wage appears to be relatively similar for applicants who are certified vs. those who are denied, with a slightly higher wage for those who are certified (keeping in mind that hourly wages could be bringing down the median for denied cases).
# Check for null values in the data.
data2.isnull().sum()
continent 0 education_of_employee 0 has_job_experience 0 requires_job_training 0 no_of_employees 0 region_of_employment 0 prevailing_wage 0 unit_of_wage 0 full_time_position 0 case_status 0 yrs_since_estab 0 dtype: int64
There are no missing values to be treated.
# Use boxplot to visualize outliers.
numeric_columns = data2.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There are significant outliers in the numeric columns, however these are proper values and therefore will not be treated.
# Separate the x & y variables and split each into training and testing sets.
X = data3.drop(['case_status'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data3['case_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
(17836, 21) (7644, 21)
# Print the features of the training and test sets.
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set: 1 0.668 0 0.332 Name: case_status, dtype: float64 Percentage of classes in test set: 1 0.668 0 0.332 Name: case_status, dtype: float64
The distribution of certified vs. denied case status has been preserved in the training and testing data.
# Create a user-defined function to display a DataFrame of model performance metrics.
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute accuracy
recall = recall_score(target, pred) # to compute recall
precision = precision_score(target, pred) # to compute precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# Create a user-defined function to display a confusion matrix for classification models.
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Fit the initial decision tree model.
dtree = DecisionTreeClassifier(criterion='gini',random_state=1)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
# Display the confusion matrix for the initial decision tree on the training data.
confusion_matrix_sklearn(dtree, X_train, y_train)
# Display the DataFrame of performance measures for the initial decision tree on the training data.
dtree_model_train_perf=model_performance_classification_sklearn(dtree, X_train, y_train)
print("Training performance \n",dtree_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
As expected, the initial tree is overfitting the training data so much so that it has achieved 100% accuracy/recall/precision on the training data.
# Display the confusion matrix for the initial decision tree on the test data.
confusion_matrix_sklearn(dtree, X_test, y_test)
# Display the DataFrame of performance measures for the initial decision tree on the test data.
dtree_model_test_perf=model_performance_classification_sklearn(dtree, X_test, y_test)
print("Test performance \n",dtree_model_test_perf)
Test performance
Accuracy Recall Precision F1
0 0.658 0.738 0.747 0.742
The initial decision tree model is not able to generalize to the testing data due to severely overfitting to the training data.
# Fit the initial bagging classifier model.
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
# Display the confusion matrix for the initial bagging classifier model on the training data.
confusion_matrix_sklearn(bagging, X_train, y_train)
# Display the DataFrame of performance measures for the initial bagging classifier model on the training data.
bagging_model_train_perf=model_performance_classification_sklearn(bagging, X_train, y_train)
print("Training performance \n",bagging_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.985 0.986 0.992 0.989
The model is performing well on the training data but it may be overfitting.
# Display the confusion matrix for the initial bagging classifier model on the test data.
confusion_matrix_sklearn(bagging, X_test, y_test)
# Display the DataFrame of performance measures for the initial bagging classifier model on the test data.
bagging_model_test_perf=model_performance_classification_sklearn(bagging, X_test, y_test)
print("Testing performance \n",bagging_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.695 0.768 0.774 0.771
The initial bagging classifier model is significantly overfitting to the training data. It is performing slightly better than the inital decision tree model on all measures.
# Fit the initial random forest model.
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
# Display the confusion matrix for the initial random forest model on the training data.
confusion_matrix_sklearn(rf,X_train,y_train)
# Display the DataFrame of performance measures for the initial random forest model on the training data.
rf_model_train_perf=model_performance_classification_sklearn(rf,X_train,y_train)
print("Training performance \n",rf_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
The initial random forest model is overfitting the training data so much so that it has achieved 100% accuracy/recall/precision on the training data.
# Display the confusion matrix for the initial random forest model on the test data.
confusion_matrix_sklearn(rf,X_test,y_test)
# Display the DataFrame of performance measures for the initial random forest model on the test data.
rf_model_test_perf=model_performance_classification_sklearn(rf,X_test,y_test)
print("Test performance \n",rf_model_test_perf)
Test performance
Accuracy Recall Precision F1
0 0.721 0.834 0.768 0.799
Although the random forest model is severely overfitting the training data, it is performing better than the bagging classifier on the test data for 3 of 4 measures with a notable improvement in recall. Precision is slightly reduced.
np.arange(5, 16, 5)
array([ 5, 10, 15])
# Use GridSearch to perform hyperparameter tuning on the decision tree model.
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': [5, 10, 15],
'min_samples_leaf': [2, 5, 7],
'max_leaf_nodes' : [2, 3, 5],
'min_impurity_decrease': [0.0001,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=2, min_impurity_decrease=0.1,
min_samples_leaf=2, random_state=1)
# Display the confusion matrix for the tuned decision tree on the training data.
confusion_matrix_sklearn(dtree_estimator, X_train, y_train)
# Display the DataFrame of performance measures for the tuned decision tree on the training data.
dtree_estimator_model_train_perf=model_performance_classification_sklearn(dtree_estimator, X_train,y_train)
print("Training performance \n",dtree_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.668 1.000 0.668 0.801
The tuned decision tree appears to have achieved 100% recall simply by labeling every case as certified, which will not be a practical strategy for our final model.
# Display the confusion matrix for the tuned decision tree on the test data.
confusion_matrix_sklearn(dtree_estimator, X_test,y_test)
# Display the DataFrame of performance measures for the tuned decision tree on the test data.
dtree_estimator_model_test_perf=model_performance_classification_sklearn(dtree_estimator, X_test, y_test)
print("Testing performance \n",dtree_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.668 1.000 0.668 0.801
Again the tuned decision tree has predicted that all cases will be certified, resulting in 100% recall but poor accuracy and precision. This model will not serve the desired purpose.
# Use GridSearch to tune the bagging classifier model.
cl1 = DecisionTreeClassifier(random_state=1)
param_grid = {'base_estimator':[cl1],
'n_estimators':[15,51,101],
'max_features': [0.7,0.9,1]
}
grid = GridSearchCV(BaggingClassifier(random_state=1,bootstrap=True), param_grid=param_grid, scoring = 'recall', cv = 5)
grid.fit(X_train, y_train)
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
max_features=1, n_estimators=101, random_state=1)
# Display the confusion matrix for the tuned bagging classifier on the train data.
confusion_matrix_sklearn(bagging_estimator, X_train,y_train)
# Display the DataFrame of performance measures for the tuned bagging classifier on the training data.
bagging_estimator_model_train_perf=model_performance_classification_sklearn(bagging_estimator, X_train,y_train)
print("Training performance \n",bagging_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.668 1.000 0.668 0.801
Like the tuned decision tree, the tuned bagging classifier is classifying all cases as certified in order to achieve 100% recall but neglecting accuracy and precision.
# Display the confusion matrix for the tuned bagging classifier on the test data.
confusion_matrix_sklearn(bagging_estimator, X_test,y_test)
# Display the DataFrame of performance measures for the tuned bagging classifier on the test data.
bagging_estimator_model_test_perf=model_performance_classification_sklearn(bagging_estimator, X_test, y_test)
print("Testing performance \n",bagging_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.668 1.000 0.668 0.801
The tuned bagging classifier model again classifies all cases as certified, thereby achieving 100% recall but with poor performance on accuracy and precision measures.
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [25, 75, 100],
"min_samples_leaf": np.arange(1, 4),
"max_features": [0.7,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features='log2', max_samples=0.7, min_samples_leaf=3,
random_state=1)
# Display the confusion matrix for the tuned random forest model on the training data.
confusion_matrix_sklearn(rf_estimator, X_train,y_train)
# Display the DataFrame of performance measures for the tuned random forest on the training data.
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.823 0.928 0.828 0.875
The tuned random forest is performing well across all measures seemingly without overfitting to the training data.
# Display the confusion matrix for the tuned random forest model on the test data.
confusion_matrix_sklearn(rf_estimator, X_test,y_test)
# Display the DataFrame of performance measures for the tuned random forest on the test data.
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator, X_test, y_test)
print("Testing performance \n",rf_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.740 0.871 0.770 0.817
The tuned random forest is performing slightly better than the inital random forest model across all measures with notable improvement in recall. This is the best performing model so far.
# Fit the initial AdaBoost model.
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)
# Display the confusion matrix for the initial AdaBoost model on the training data.
confusion_matrix_sklearn(ab_classifier,X_train,y_train)
# Display the DataFrame of performance measures for the initial AdaBoost model on the training data.
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print(ab_classifier_model_train_perf)
Accuracy Recall Precision F1 0 0.738 0.887 0.760 0.819
Performance of the initial AdaBoost model on the training data is comparable to the performance of the tuned random forest on the test data.
# Display the confusion matrix for the initial AdaBoost model on the test data.
confusion_matrix_sklearn(ab_classifier,X_test,y_test)
# Display the DataFrame of performance measures for the initial AdaBoost model on the test data.
ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test)
print(ab_classifier_model_test_perf)
Accuracy Recall Precision F1 0 0.733 0.885 0.757 0.816
The performance of the AdaBoost model is very similar to the performance of the tuned random forest across all measures, without overfitting to the training data.
#Fit the initial GradientBoost model.
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)
# Display the confusion matrix for the initial GradientBoost model on the training data.
confusion_matrix_sklearn(gb_classifier,X_train,y_train)
# Display the DataFrame of performance measures for the initial GradientBoost model on the training data.
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
Training performance:
Accuracy Recall Precision F1
0 0.759 0.883 0.784 0.831
The initial GradientBoost model has slightly better performance on 3 of 4 measures on the training data compared to the AdaBoost model (recall is negligibly lower).
# Display the confusion matrix for the initial GradientBoost model on the test data.
confusion_matrix_sklearn(gb_classifier,X_test,y_test)
# Display the DataFrame of performance measures for the initial GradientBoost model on the test data.
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)
Testing performance:
Accuracy Recall Precision F1
0 0.745 0.873 0.774 0.820
This appears to be the best model performance on the test data so far, having the best performance on all 4 measures without overfitting to the training data.
# Use GridSearch to perform hyperparameter tuning on the AdaBoost model.
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2)],
"n_estimators": np.arange(40,110,20),
"learning_rate":np.arange(0.1,0.7,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
learning_rate=0.1, n_estimators=40, random_state=1)
# Display the confusion matrix for the tuned AdaBoost model on the training data.
confusion_matrix_sklearn(abc_tuned,X_train,y_train)
# Display the DataFrame of performance measures for the tuned AdaBoost model on the training data.
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print(abc_tuned_model_train_perf)
Accuracy Recall Precision F1 0 0.718 0.936 0.723 0.816
This model has improved recall but lower accuracy and precision compared to the initial GradientBoost model.
# Display the confusion matrix for the tuned AdaBoost model on the test data.
confusion_matrix_sklearn(abc_tuned,X_test,y_test)
# Display the DataFrame of performance measures for the tuned AdaBoost model on the test data.
abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test)
print(abc_tuned_model_test_perf)
Accuracy Recall Precision F1 0 0.711 0.936 0.718 0.812
The tuned AdaBoost model is not showing overfitting as performance measures are very similar across training and testing data. However the initial GradientBoost model still seems to be the best model overall as it achieves better balance across the performance measures.
# Use GridSearch to perform hyperparameter tuning on the GradientBoost model.
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [150,200,250],
"subsample":[0.8,1],
"max_features":[0.7,0.9],
"learning_rate":np.arange(0.1,0.7,0.2)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=200, random_state=1,
subsample=1)
# Display the confusion matrix for the tuned GradientBoost model on the training data.
confusion_matrix_sklearn(gbc_tuned,X_train,y_train)
# Display the DataFrame of performance measures for the tuned GradientBoost model on the training data.
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
Training performance:
Accuracy Recall Precision F1
0 0.765 0.886 0.789 0.834
The tuned GradientBoost model has slightly improved performance on the training data for all measures compared to the initial GradientBoost model.
# Display the confusion matrix for the tuned GradientBoost model on the test data.
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)
# Display the DataFrame of performance measures for the tuned GradientBoost model on the test data.
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)
Testing performance:
Accuracy Recall Precision F1
0 0.746 0.874 0.774 0.821
On the testing data there is a negligible improvement in performance measures compared to the initial GradientBoost model.
estimators = [('Random Forest',rf_estimator), ('Gradient Boosting',gbc_tuned), ('AdaBoost',ab_classifier)]
final_estimator = XGBClassifier(random_state=1, eval_metric='logloss')
stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(max_features='log2',
max_samples=0.7,
min_samples_leaf=3,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9,
n_estimators=200,
random_state=1,
subsample=1)),
('AdaBoost',
AdaBoostClassifier(random_state=1))],
final_estimator=XGBClassifier(eval_metric='logloss',
random_state=1))
# Display the confusion matrix for the stacking model on the training data.
confusion_matrix_sklearn(stacking_classifier,X_train,y_train)
# Display the DataFrame of performance measures for the stacking model on the training data.
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
Training performance:
Accuracy Recall Precision F1
0 0.788 0.900 0.806 0.850
The stacking model has improved performance on all measures on the testing data compared to the tuned GradientBoost model.
# Display the confusion matrix for the stacking model on the test data.
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
# Display the DataFrame of performance measures for the stacking model on the test data.
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)
Testing performance:
Accuracy Recall Precision F1
0 0.742 0.870 0.772 0.818
It appears that the improved performance compared to the tuned GradientBoost model was due to overfitting to the training data; on the test data, the stacking model is performing slightly below the tuned GradientBoost model on all measures.
# Compare the performance of all models on the training data.
models_train_comp_df = pd.concat(
[dtree_model_train_perf.T,dtree_estimator_model_train_perf.T,rf_model_train_perf.T,rf_estimator_model_train_perf.T,
bagging_model_train_perf.T,bagging_estimator_model_train_perf.T,ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,gb_classifier_model_train_perf.T,gbc_tuned_model_train_perf.T,stacking_classifier_model_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Estimator Tuned",
"Adaboost Classifier",
"Adaboost Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"Stacking Classifier"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Decision Tree Estimator | Random Forest | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | Adaboost Classifier | Adaboost Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.000 | 0.668 | 1.000 | 0.823 | 0.985 | 0.668 | 0.738 | 0.718 | 0.759 | 0.765 | 0.788 |
| Recall | 1.000 | 1.000 | 1.000 | 0.928 | 0.986 | 1.000 | 0.887 | 0.936 | 0.883 | 0.886 | 0.900 |
| Precision | 1.000 | 0.668 | 1.000 | 0.828 | 0.992 | 0.668 | 0.760 | 0.723 | 0.784 | 0.789 | 0.806 |
| F1 | 1.000 | 0.801 | 1.000 | 0.875 | 0.989 | 0.801 | 0.819 | 0.816 | 0.831 | 0.834 | 0.850 |
# Compare the performance of all models on the test data.
models_test_comp_df = pd.concat(
[dtree_model_test_perf.T,dtree_estimator_model_test_perf.T,rf_model_test_perf.T,rf_estimator_model_test_perf.T,
bagging_model_test_perf.T,bagging_estimator_model_test_perf.T,ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,gb_classifier_model_test_perf.T,gbc_tuned_model_test_perf.T,stacking_classifier_model_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Estimator Tuned",
"Adaboost Classifier",
"Adaboost Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"Stacking Classifier"]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Decision Tree | Decision Tree Estimator | Random Forest | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | Adaboost Classifier | Adaboost Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.658 | 0.668 | 0.721 | 0.740 | 0.695 | 0.668 | 0.733 | 0.711 | 0.745 | 0.746 | 0.742 |
| Recall | 0.738 | 1.000 | 0.834 | 0.871 | 0.768 | 1.000 | 0.885 | 0.936 | 0.873 | 0.874 | 0.870 |
| Precision | 0.747 | 0.668 | 0.768 | 0.770 | 0.774 | 0.668 | 0.757 | 0.718 | 0.774 | 0.774 | 0.772 |
| F1 | 0.742 | 0.801 | 0.799 | 0.817 | 0.771 | 0.801 | 0.816 | 0.812 | 0.820 | 0.821 | 0.818 |
# Visualize the feature importance for the final model.
feature_names = X_train.columns
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
High school education level is by far the feature of highest importance. This aligns with the earlier data analysis which showed that applicants with only a high school level of education are more likely to have their visa denied rather than certified, while people with higher levels of education are more likely to have their application certified vs. denied. Having job experience is the 2nd most important feature, followed by prevailing wage, Master's degree and Doctorate.
Based on the exploratory data analysis and machine learning models, certified applicants tend to have the following features:
The company should use the tuned GradientBoost model to identify applicants with the greatest odds of visa certification.
Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generator could be repaired before failing/breaking to reduce the maintenance cost. The different costs associated with maintenance are as follows:
Replacement cost = $40,000Repair cost = $15,000Inspection cost = $5,000“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.
# Import libraries for data manipulation.
import pandas as pd
import numpy as np
# Import libraries for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Import library to split data.
from sklearn.model_selection import train_test_split
# Impute SimpleImputer to deal with missing values.
from sklearn.impute import SimpleImputer
# Import functions to evaluate model performance.
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
# Import library to build a logistic regression model.
from sklearn.linear_model import LogisticRegression
# Import library to build a decision tree model.
from sklearn.tree import DecisionTreeClassifier
# Import ensemble methods.
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Import libraries for k-fold cross validation.
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Import libraries to oversample and undersample data.
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Import RandomizedSearchCV for model tuning.
from sklearn.model_selection import RandomizedSearchCV
# Import libraries to create and modify pipelines.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Import libraries to ignore irrelevant warnings.
import warnings
warnings.filterwarnings('ignore')
# Set the precision of floating numbers to 3 decimal points.
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Load the data sets.
train = pd.read_csv('/content/drive/My Drive/Data science/Data sets/Train.csv')
test = pd.read_csv('/content/drive/My Drive/Data science/Data sets/Test.csv')
# Create copies of the data sets to avoid altering the original data.
train0 = train.copy()
test0 = test.copy()
# Use print function and .shape attribute to display the number of rows & columns in each data set.
print('There are',train0.shape[0],'rows and',train0.shape[1],'columns in the training data set.')
print('There are',test0.shape[0],'rows and',test0.shape[1],'columns in the testing data set.')
There are 40000 rows and 41 columns in the training data set. There are 10000 rows and 41 columns in the testing data set.
# Display a sample of 5 rows from each data set to get a general idea of the information & make sure they are loaded properly.
train0.sample(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24482 | -1.893 | 1.124 | 2.268 | -0.685 | 2.284 | -1.943 | -1.091 | 1.678 | -1.320 | -0.787 | ... | 4.395 | -0.075 | -2.686 | 1.269 | 3.710 | 0.737 | -1.864 | 0.879 | 0.911 | 0 |
| 11743 | 2.328 | -3.935 | 10.720 | -2.265 | -4.851 | -3.617 | -2.130 | -0.336 | 0.192 | 2.799 | ... | -6.279 | -2.592 | 3.408 | 7.088 | 7.512 | -0.531 | -6.752 | 3.495 | -3.322 | 0 |
| 8692 | 0.952 | 1.828 | 0.627 | -4.590 | 0.951 | 0.336 | -0.548 | 1.471 | 0.801 | -5.211 | ... | -1.953 | 2.536 | -6.538 | -0.152 | 0.508 | 2.678 | -3.713 | -0.282 | 5.223 | 0 |
| 38184 | -4.174 | -2.618 | -0.984 | 0.475 | 0.245 | -2.189 | -2.336 | 1.323 | -0.130 | -0.161 | ... | 4.279 | -1.642 | 0.122 | 1.342 | 2.632 | 2.270 | -2.576 | 1.683 | -2.760 | 0 |
| 18288 | -4.586 | 1.305 | 2.605 | 2.192 | 3.607 | -2.984 | -3.190 | 3.248 | -1.544 | -1.143 | ... | 14.279 | 5.986 | -5.972 | 5.787 | 1.996 | -1.161 | -5.060 | -0.676 | -1.772 | 0 |
5 rows × 41 columns
test0.sample(5)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2487 | 6.952 | 1.993 | 4.852 | -6.759 | -1.345 | 2.574 | 1.228 | -5.555 | -0.032 | 2.201 | ... | -6.674 | 1.496 | -2.200 | 2.578 | 3.283 | -2.056 | 4.630 | 1.841 | 1.636 | 0 |
| 6885 | -1.548 | -0.392 | 5.059 | 1.832 | -1.974 | -1.070 | -1.738 | -3.859 | 0.513 | 1.291 | ... | -6.186 | -2.628 | 4.300 | 3.235 | 1.953 | 0.701 | 3.427 | 2.833 | -3.052 | 0 |
| 9453 | 1.397 | 1.853 | 5.400 | -0.087 | 0.341 | -5.516 | 1.166 | 5.873 | -2.683 | -1.162 | ... | 0.375 | -6.597 | 1.013 | 2.064 | 4.931 | 1.280 | -4.513 | 0.509 | 1.979 | 0 |
| 7733 | 0.049 | -0.642 | 3.362 | 0.730 | -0.260 | -1.343 | -1.820 | -1.911 | 2.322 | 0.751 | ... | -0.444 | -1.977 | -0.073 | 0.699 | 2.956 | -0.387 | -2.852 | 1.713 | -2.355 | 0 |
| 563 | -4.556 | -3.610 | -1.024 | -1.337 | -0.866 | 0.358 | -1.687 | -2.053 | 0.124 | 1.109 | ... | -3.390 | -1.942 | 3.067 | -1.668 | 3.256 | 2.864 | 1.409 | 3.487 | -2.080 | 0 |
5 rows × 41 columns
The data sets appear to be loaded properly.
# Use .info() method to check the non-null counts and data types of each column in the training set.
train0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40000 entries, 0 to 39999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 39954 non-null float64 1 V2 39961 non-null float64 2 V3 40000 non-null float64 3 V4 40000 non-null float64 4 V5 40000 non-null float64 5 V6 40000 non-null float64 6 V7 40000 non-null float64 7 V8 40000 non-null float64 8 V9 40000 non-null float64 9 V10 40000 non-null float64 10 V11 40000 non-null float64 11 V12 40000 non-null float64 12 V13 40000 non-null float64 13 V14 40000 non-null float64 14 V15 40000 non-null float64 15 V16 40000 non-null float64 16 V17 40000 non-null float64 17 V18 40000 non-null float64 18 V19 40000 non-null float64 19 V20 40000 non-null float64 20 V21 40000 non-null float64 21 V22 40000 non-null float64 22 V23 40000 non-null float64 23 V24 40000 non-null float64 24 V25 40000 non-null float64 25 V26 40000 non-null float64 26 V27 40000 non-null float64 27 V28 40000 non-null float64 28 V29 40000 non-null float64 29 V30 40000 non-null float64 30 V31 40000 non-null float64 31 V32 40000 non-null float64 32 V33 40000 non-null float64 33 V34 40000 non-null float64 34 V35 40000 non-null float64 35 V36 40000 non-null float64 36 V37 40000 non-null float64 37 V38 40000 non-null float64 38 V39 40000 non-null float64 39 V40 40000 non-null float64 40 Target 40000 non-null int64 dtypes: float64(40), int64(1) memory usage: 12.5 MB
Columns V1 and V2 have missing (null) values. All of the columns are of numeric data types.
# Use .info() method to check the non-null counts and data types of each column in the test set.
test0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 9989 non-null float64 1 V2 9993 non-null float64 2 V3 10000 non-null float64 3 V4 10000 non-null float64 4 V5 10000 non-null float64 5 V6 10000 non-null float64 6 V7 10000 non-null float64 7 V8 10000 non-null float64 8 V9 10000 non-null float64 9 V10 10000 non-null float64 10 V11 10000 non-null float64 11 V12 10000 non-null float64 12 V13 10000 non-null float64 13 V14 10000 non-null float64 14 V15 10000 non-null float64 15 V16 10000 non-null float64 16 V17 10000 non-null float64 17 V18 10000 non-null float64 18 V19 10000 non-null float64 19 V20 10000 non-null float64 20 V21 10000 non-null float64 21 V22 10000 non-null float64 22 V23 10000 non-null float64 23 V24 10000 non-null float64 24 V25 10000 non-null float64 25 V26 10000 non-null float64 26 V27 10000 non-null float64 27 V28 10000 non-null float64 28 V29 10000 non-null float64 29 V30 10000 non-null float64 30 V31 10000 non-null float64 31 V32 10000 non-null float64 32 V33 10000 non-null float64 33 V34 10000 non-null float64 34 V35 10000 non-null float64 35 V36 10000 non-null float64 36 V37 10000 non-null float64 37 V38 10000 non-null float64 38 V39 10000 non-null float64 39 V40 10000 non-null float64 40 Target 10000 non-null int64 dtypes: float64(40), int64(1) memory usage: 3.1 MB
Columns V1 and V2 have missing (null) values. All of the columns are of numeric data types.
# Check the statistical summary for the training data set.
train0.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 39954.000 | 39961.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | ... | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 | 40000.000 |
| mean | -0.288 | 0.443 | 2.506 | -0.066 | -0.045 | -1.001 | -0.893 | -0.563 | -0.008 | -0.002 | ... | 0.327 | 0.057 | -0.464 | 2.235 | 1.530 | -0.000 | -0.351 | 0.900 | -0.897 | 0.055 |
| std | 3.449 | 3.139 | 3.406 | 3.437 | 2.107 | 2.037 | 1.757 | 3.299 | 2.162 | 2.183 | ... | 5.499 | 3.574 | 3.186 | 2.924 | 3.820 | 1.778 | 3.964 | 1.751 | 2.998 | 0.227 |
| min | -13.502 | -13.212 | -11.469 | -16.015 | -8.613 | -10.227 | -8.206 | -15.658 | -8.596 | -11.001 | ... | -23.201 | -17.454 | -17.985 | -15.350 | -17.479 | -7.640 | -17.375 | -7.136 | -11.930 | 0.000 |
| 25% | -2.751 | -1.638 | 0.203 | -2.350 | -1.507 | -2.363 | -2.037 | -2.660 | -1.494 | -1.391 | ... | -3.392 | -2.238 | -2.128 | 0.332 | -0.937 | -1.266 | -3.017 | -0.262 | -2.950 | 0.000 |
| 50% | -0.774 | 0.464 | 2.265 | -0.124 | -0.097 | -1.007 | -0.935 | -0.384 | -0.052 | 0.106 | ... | 0.056 | -0.050 | -0.251 | 2.110 | 1.572 | -0.133 | -0.319 | 0.921 | -0.949 | 0.000 |
| 75% | 1.837 | 2.538 | 4.585 | 2.149 | 1.346 | 0.374 | 0.207 | 1.714 | 1.426 | 1.486 | ... | 3.789 | 2.256 | 1.433 | 4.045 | 3.997 | 1.161 | 2.291 | 2.069 | 1.092 | 0.000 |
| max | 17.437 | 13.089 | 18.366 | 13.280 | 9.403 | 7.065 | 8.006 | 11.679 | 8.507 | 8.108 | ... | 24.848 | 16.692 | 14.358 | 16.805 | 19.330 | 7.803 | 15.964 | 7.998 | 10.654 | 1.000 |
8 rows × 41 columns
# Check the statistical summary for the testing data set.
test0.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9989.000 | 9993.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | ... | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 | 10000.000 |
| mean | -0.260 | 0.417 | 2.555 | -0.054 | -0.085 | -1.014 | -0.908 | -0.599 | 0.026 | 0.019 | ... | 0.253 | 0.008 | -0.423 | 2.258 | 1.550 | -0.006 | -0.373 | 0.920 | -0.937 | 0.055 |
| std | 3.440 | 3.160 | 3.395 | 3.462 | 2.102 | 2.039 | 1.737 | 3.343 | 2.180 | 2.169 | ... | 5.503 | 3.552 | 3.168 | 2.919 | 3.794 | 1.782 | 3.998 | 1.726 | 3.011 | 0.227 |
| min | -12.382 | -11.625 | -12.941 | -14.682 | -7.712 | -8.949 | -8.124 | -12.710 | -7.570 | -8.291 | ... | -20.520 | -14.904 | -17.135 | -19.522 | -14.912 | -5.362 | -15.335 | -7.147 | -10.779 | 0.000 |
| 25% | -2.700 | -1.701 | 0.238 | -2.371 | -1.593 | -2.377 | -2.035 | -2.684 | -1.490 | -1.362 | ... | -3.459 | -2.280 | -2.066 | 0.375 | -0.892 | -1.278 | -3.004 | -0.225 | -2.993 | 0.000 |
| 50% | -0.719 | 0.456 | 2.283 | -0.169 | -0.144 | -1.015 | -0.938 | -0.387 | -0.086 | 0.150 | ... | -0.013 | -0.099 | -0.203 | 2.149 | 1.620 | -0.150 | -0.362 | 0.933 | -0.986 | 0.000 |
| 75% | 1.861 | 2.526 | 4.656 | 2.144 | 1.324 | 0.354 | 0.189 | 1.698 | 1.466 | 1.537 | ... | 3.762 | 2.198 | 1.472 | 4.088 | 4.061 | 1.193 | 2.335 | 2.097 | 1.085 | 0.000 |
| max | 13.504 | 14.079 | 15.409 | 12.896 | 7.673 | 6.273 | 7.616 | 10.792 | 8.851 | 7.691 | ... | 26.539 | 13.324 | 14.581 | 13.489 | 17.116 | 7.682 | 13.726 | 7.234 | 10.392 | 1.000 |
8 rows × 41 columns
The statistical summary cannot be readily interpreted due to the confidentiality of the features.
# Check for duplicated values in the training and testing data sets.
print('There are',train0.duplicated().sum(),'duplicated values in the training set.')
print('There are',test0.duplicated().sum(),'duplicated values in the test set.')
There are 0 duplicated values in the training set. There are 0 duplicated values in the test set.
# Check for missing (null) values in the training data set.
train0.isnull().sum()[train0.isnull().sum()>0]
V1 46 V2 39 dtype: int64
# Check for missing (null) values in the testing data set.
test0.isnull().sum()[test0.isnull().sum()>0]
V1 11 V2 7 dtype: int64
As noted previously using the .info() method, both data sets contain null values in the V1 and V2 columns. These null values will be dealt with after the initial data analysis is complete.
# Define a function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="skyblue"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
sns.set_palette("Set3")
# Plot histograms and boxplots for each of the features in the training data set.
for feature in train0.drop('Target', axis=1).columns:
histogram_boxplot(train0, feature, figsize=(12, 7), kde=False, bins=None)
Due to the confidentiality of the features it is difficult to interpret the visualizations. However it can be noted that almost all of the features appear to have near-normal distributions. All of the features contain a significant number of outliers.
# Plot histograms and boxplots for each of the features in the testing data set.
for feature in test0.drop('Target', axis=1).columns:
histogram_boxplot(test0, feature, figsize=(12, 7), kde=False, bins=None)
The distributions appear similar across the training and testing data sets. The test data distributions are somewhat less symmetrical/normal, likely due to the testing data set having fewer observations.
# Visualize the distribution of the target variable in the training data set using countplot.
sns.set_palette("Pastel1")
plt.title('Countplot: Generator Failure (training data)')
sns.countplot(data=train0, x ='Target')
plt.xlabel("Generator Failure")
plt.ylabel('Count');
# Print the percentage of each value for the target variable in the training set.
print("Percentage of classes in training set:")
print(train0.Target.value_counts(normalize=True))
Percentage of classes in training set: 0 0.945 1 0.055 Name: Target, dtype: float64
Generator failure is significantly less common than non-failure, making the data set highly imbalanced.
# Visualize the distribution of the target variable in the testing data set using countplot.
plt.title('Countplot: Generator Failure (test data)')
sns.countplot(data=test0, x ='Target')
plt.xlabel("Generator Failure")
plt.ylabel('Count');
# Print the percentage of each value for the target variable in the test set.
print("Percentage of classes in test set:")
print(test0.Target.value_counts(normalize=True))
Percentage of classes in test set: 0 0.945 1 0.055 Name: Target, dtype: float64
The distribution of failure vs. non-failure in testing data is the same as that of the training data set (highly imbalanced toward non-failure). Oversampling or undersampling techniques may help to mitigate the imbalance.
# Separate the x & y variables in the training data set.
X_train0 = train0.drop(['Target'],axis=1)
y_train0 = train0['Target']
# Separate the x & y variables in the test data set.
X_test = test0.drop(['Target'], axis=1)
y_test = test0['Target']
# Split the training data set into a new training set and validation set.
X_train, X_val, y_train, y_val = train_test_split(X_train0, y_train0, test_size=0.3, random_state=1, stratify=y_train0)
# Print the shapes of the new training set, the validation set and the test set to verify that they have been distributed appropriately.
print(X_train.shape, X_val.shape, X_test.shape)
(28000, 40) (12000, 40) (10000, 40)
# Define the imputer to replace null values with the median value for that feature.
imp_median = SimpleImputer(missing_values=np.nan, strategy="median")
# Fit the imputer on the training data columns with null values and transform the training data.
X_train["V1"], X_train["V2"] = imp_median.fit_transform(X_train[["V1"]]), imp_median.fit_transform(X_train[["V2"]])
# Transform the validation and test data using the imputer fit on the training data.
X_val["V1"], X_val["V2"] = imp_median.transform(X_val[["V1"]]), imp_median.transform(X_val[["V2"]])
X_test["V1"], X_test["V2"] = imp_median.transform(X_test[["V1"]]), imp_median.transform(X_test[["V2"]])
# Verify that there are no missing values remaining in the training, validation and test sets.
print(X_train.isnull().sum()[X_train.isnull().sum()>0])
print(X_val.isnull().sum()[X_train.isnull().sum()>0])
print(X_test.isnull().sum()[X_train.isnull().sum()>0])
Series([], dtype: int64) Series([], dtype: int64) Series([], dtype: int64)
There are no missing values remaining in the training, validation and test sets.
The nature of predictions made by the classification model will translate as follows:
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# Define a function to compute different metrics to check performance of a classification model built using sklearn.
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute accuracy
recall = recall_score(target, pred) # to compute recall
precision = precision_score(target, pred) # to compute precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a DataFrame of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
# Create a user-defined function to display a confusion matrix for classification models.
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Define a scorer to compare parameter combinations.
scorer = metrics.make_scorer(metrics.recall_score)
Sample Decision Tree model building with original data
# Create an empty list to store all the models.
models = []
# Appending 6 models into the list
models.append(("Decision tree", DecisionTreeClassifier(random_state=1)))
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Random forest", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient boost", GradientBoostingClassifier(random_state=1)))
# Create an empty list to store all model's CV scores.
results1 = []
# Create an empty list to store the model names.
names = []
# Loop through all the models to get the mean cross validated score.
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Cost: Decision tree: 0.7413212407655788 Logistic regression: 0.48792872197739035 Random forest: 0.7413212407655788 Bagging: 0.7256583849609334 AdaBoost: 0.6192012092567755 Gradient boost: 0.7230291030635925 Validation Performance: Decision tree: 0.7454268292682927 Logistic regression: 0.4634146341463415 Random forest: 0.7454268292682927 Bagging: 0.7240853658536586 AdaBoost: 0.6051829268292683 Gradient boost: 0.7195121951219512
The scores for each model are comparable across the training set and validation set. Decision tree and random forest have the highest recall of all the models.
# Create a DataFrame to store the cross validation scores for each model.
cv_scores = pd.DataFrame(results1, columns=[1,2,3,4,5], index=names)
cv_scores.T
| Decision tree | Logistic regression | Random forest | Bagging | AdaBoost | Gradient boost | |
|---|---|---|---|---|---|---|
| 1 | 0.755 | 0.520 | 0.755 | 0.709 | 0.660 | 0.729 |
| 2 | 0.761 | 0.500 | 0.761 | 0.745 | 0.611 | 0.739 |
| 3 | 0.706 | 0.474 | 0.706 | 0.716 | 0.588 | 0.703 |
| 4 | 0.706 | 0.477 | 0.706 | 0.716 | 0.614 | 0.680 |
| 5 | 0.779 | 0.469 | 0.779 | 0.743 | 0.622 | 0.765 |
# Use a for loop to display boxplots for each of the models' cross validation scores.
plt.figure(figsize=(15,3))
for i, model in enumerate(names):
plt.subplot(1, 6, i+1)
plt.boxplot(cv_scores.T[model], whis=1.5)
plt.tight_layout()
plt.title(model)
plt.show()
The boxplots further illustrate that the decision tree and random forest models scored the highest recall out of all the models.
# Use the Synthetic Minority Over Sampling Technique (SMOTE) to increase the importance of the minority class and thereby improve recall.
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Print label counts before and after oversampling to verify that the technique has been correctly executed.
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 1531 Before Oversampling, counts of label 'No': 26469 After Oversampling, counts of label 'Yes': 26469 After Oversampling, counts of label 'No': 26469 After Oversampling, the shape of train_X: (52938, 40) After Oversampling, the shape of train_y: (52938,)
As expected, the oversampling technique has resulted in equal counts of both classes. The count of the minority class has been inflated to be equal to that of the majority class.
# Create an empty list to store all model's CV scores.
results_over = []
# Loop through all the models to get the mean cross validated score.
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results_over.append(cv_result_over)
print("{}: {}".format(name, cv_result_over.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_over = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_over))
Cross-Validation Cost: Decision tree: 0.9689069274906782 Logistic regression: 0.8747214299831179 Random forest: 0.9689069274906782 Bagging: 0.9747252485284147 AdaBoost: 0.8942159673577901 Gradient boost: 0.9163928365232223 Validation Performance: Decision tree: 0.8125 Logistic regression: 0.8399390243902439 Random forest: 0.8125 Bagging: 0.8414634146341463 AdaBoost: 0.8536585365853658 Gradient boost: 0.8765243902439024
Oversampling the minority class has resulted in higher recall scores although there is some overfitting to the training data. The 2 boosting models have the highest recall scores on the validation set with less overfitting compared to the other models.
# Create a DataFrame to store the cross validation scores for each model built on the oversampled data.
cv_scores_over = pd.DataFrame(results_over, columns=[1,2,3,4,5], index=names)
cv_scores_over.T
| Decision tree | Logistic regression | Random forest | Bagging | AdaBoost | Gradient boost | |
|---|---|---|---|---|---|---|
| 1 | 0.967 | 0.875 | 0.967 | 0.971 | 0.893 | 0.914 |
| 2 | 0.972 | 0.878 | 0.972 | 0.979 | 0.894 | 0.920 |
| 3 | 0.973 | 0.871 | 0.973 | 0.974 | 0.893 | 0.915 |
| 4 | 0.966 | 0.876 | 0.966 | 0.977 | 0.897 | 0.918 |
| 5 | 0.966 | 0.873 | 0.966 | 0.971 | 0.893 | 0.915 |
# Use a for loop to display boxplots for each of the models' cross validation scores on the oversampled data.
plt.figure(figsize=(15,3))
for i, model in enumerate(names):
plt.subplot(1, 6, i+1)
plt.boxplot(cv_scores_over.T[model], whis=1.5)
plt.tight_layout()
plt.title(model)
plt.show()
The boxplots demonstrate that the bagging model, decision tree and random forest have the highest recall scores on the oversampled training data - however they are not the best models overall due to overfitting / lower performance on validation data.
# Use the RandomUnderSampler technique to reduce the importance of the majority class and thereby improve recall.
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# Print label counts before and after undersampling to verify that the technique has been correctly executed.
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1531 Before Under Sampling, counts of label 'No': 26469 After Under Sampling, counts of label 'Yes': 1531 After Under Sampling, counts of label 'No': 1531 After Under Sampling, the shape of train_X: (3062, 40) After Under Sampling, the shape of train_y: (3062,)
As expected, the undersampling technique has resulted in equal counts of both classes. The count of the majority class has been cut down to be equal to that of the minority class.
# Create an empty list to store all model's CV scores.
results_un = []
# Loop through all the models to get the mean cross validated score.
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_un = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results_un.append(cv_result_un)
print("{}: {}".format(name, cv_result_un.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_un = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_un))
Cross-Validation Cost: Decision tree: 0.8510804538970854 Logistic regression: 0.8517063720167763 Random forest: 0.8510804538970854 Bagging: 0.8706691362755743 AdaBoost: 0.8602243937748824 Gradient boost: 0.8850354474037172 Validation Performance: Decision tree: 0.8384146341463414 Logistic regression: 0.836890243902439 Random forest: 0.8384146341463414 Bagging: 0.8673780487804879 AdaBoost: 0.8628048780487805 Gradient boost: 0.8765243902439024
The undersampling technique has resulted in less overfitting compared to oversampling. Bagging, AdaBoost and gradient boost models have the highest recall performance on the validation data. The recall scores are similar to those of the boosting models trained on the oversampled data.
# Create a DataFrame to store the cross validation scores for each model built on the undersampled data.
cv_scores_un = pd.DataFrame(results_un, columns=[1,2,3,4,5], index=names)
cv_scores_un.T
| Decision tree | Logistic regression | Random forest | Bagging | AdaBoost | Gradient boost | |
|---|---|---|---|---|---|---|
| 1 | 0.886 | 0.843 | 0.886 | 0.873 | 0.873 | 0.886 |
| 2 | 0.847 | 0.889 | 0.847 | 0.876 | 0.857 | 0.896 |
| 3 | 0.804 | 0.807 | 0.804 | 0.850 | 0.840 | 0.869 |
| 4 | 0.846 | 0.856 | 0.846 | 0.869 | 0.859 | 0.866 |
| 5 | 0.873 | 0.863 | 0.873 | 0.886 | 0.873 | 0.908 |
# Use a for loop to display boxplots for each of the models' cross validation scores on the oversampled data.
plt.figure(figsize=(15,3))
for i, model in enumerate(names):
plt.subplot(1, 6, i+1)
plt.boxplot(cv_scores_un.T[model], whis=1.5)
plt.tight_layout()
plt.title(model)
plt.show()
The boxplots demonstrate that the gradient boost model had the highest cross validation recall scores.
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
# Define the model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70]}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 1, 'max_features': 0.8} with CV score=1.0:
# Build a bagging model with the best parameters
bag_tuned = BaggingClassifier(
n_estimators=70,
max_samples=1,
max_features=0.8,
random_state=1)
# Fit the model on undersampled training data
bag_tuned.fit(X_train_un, y_train_un)
BaggingClassifier(max_features=0.8, max_samples=1, n_estimators=70,
random_state=1)
# Calculate performance of the tuned model on the training data
bag_tuned_perf_train = model_performance_classification_sklearn(bag_tuned, X_train, y_train)
print("Training performance:")
bag_tuned_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.055 | 1.000 | 0.055 | 0.104 |
# Calculate performance metrics on the validation set
bag_tuned_perf_val = model_performance_classification_sklearn(bag_tuned, X_val, y_val)
print("Validation performance:")
bag_tuned_perf_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.055 | 1.000 | 0.055 | 0.104 |
# Display the confusion matrix for the tuned model on the validation set
confusion_matrix_sklearn(bag_tuned, X_val, y_val)
The tuned bagging model has achieved 100% recall by labeling every observation as a failure. This model is useless.
# Define the model
Model2 = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid2 = { "n_estimators": [100, 150, 200],
"learning_rate": [0.2, 0.05],
"base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1)]}
# Calling RandomizedSearchCV
randomized_cv2 = RandomizedSearchCV(estimator=Model2, param_distributions=param_grid2, n_iter=10, n_jobs=-1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv2.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv2.best_params_, randomized_cv2.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8883076792063187:
# Build an AdaBoost model with the best parameters
ada_tuned = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.2,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1))
# Fit the model on undersampled training data
ada_tuned.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200, random_state=1)
# Calculate performance of the tuned model on the training data
ada_tuned_perf_train = model_performance_classification_sklearn(ada_tuned, X_train, y_train)
print("Training performance:")
ada_tuned_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.948 | 1.000 | 0.514 | 0.679 |
# Calculate performance metrics on the validation set
ada_tuned_perf_val = model_performance_classification_sklearn(ada_tuned, X_val, y_val)
print("Validation performance:")
ada_tuned_perf_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.940 | 0.893 | 0.474 | 0.619 |
# Display the confusion matrix for the tuned model on the validation set
confusion_matrix_sklearn(ada_tuned, X_val, y_val)
The tuned AdaBoost model has achieved very high recall without sacrificing accuracy. This model could be useful for identifying potential failures.
# Define the model
Model3 = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid3 = {"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}
# Calling RandomizedSearchCV
randomized_cv3 = RandomizedSearchCV(estimator=Model3, param_distributions=param_grid3, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv3.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv3.best_params_, randomized_cv3.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.2} with CV score=0.8955036086095675:
# Build a gradient boost model with the best parameters
gboost_tuned = GradientBoostingClassifier(
subsample=0.5,
n_estimators=100,
max_features=0.7,
learning_rate=0.2,
random_state=1)
# Fit the model on undersampled training data
gboost_tuned.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.7, random_state=1,
subsample=0.5)
# Calculate performance of the tuned model on the training data
gboost_tuned_perf_train = model_performance_classification_sklearn(gboost_tuned, X_train, y_train)
print("Training performance:")
gboost_tuned_perf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.944 | 0.468 | 0.626 |
# Calculate performance metrics on the validation set
gboost_tuned_perf_val = model_performance_classification_sklearn(gboost_tuned, X_val, y_val)
print("Validation performance:")
gboost_tuned_perf_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.934 | 0.887 | 0.450 | 0.597 |
# Display the confusion matrix for the tuned model on the validation set
confusion_matrix_sklearn(gboost_tuned, X_val, y_val)
The tuned gradient boost model is performing similarly to the tuned AdaBoost model. It has achieved fairly high recall without sacrificing accuracy.
# Build a DataFrame to compare the performance measures for the 3 tuned models on the training data.
models_train_comp_df = pd.concat(
[
bag_tuned_perf_train.T,
ada_tuned_perf_train.T,
gboost_tuned_perf_train.T
],
axis=1,
)
models_train_comp_df.columns = [
"Tuned bagging model",
"Tuned AdaBoost model",
"Tuned gradient boost model",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Tuned bagging model | Tuned AdaBoost model | Tuned gradient boost model | |
|---|---|---|---|
| Accuracy | 0.055 | 0.948 | 0.938 |
| Recall | 1.000 | 1.000 | 0.944 |
| Precision | 0.055 | 0.514 | 0.468 |
| F1 | 0.104 | 0.679 | 0.626 |
# Build a DataFrame to compare the performance measures for the 3 tuned models on the validation data.
models_val_comp_df = pd.concat(
[
bag_tuned_perf_val.T,
ada_tuned_perf_val.T,
gboost_tuned_perf_val.T
],
axis=1,
)
models_val_comp_df.columns = [
"Tuned bagging model",
"Tuned AdaBoost model",
"Tuned gradient boost model",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Tuned bagging model | Tuned AdaBoost model | Tuned gradient boost model | |
|---|---|---|---|
| Accuracy | 0.055 | 0.940 | 0.934 |
| Recall | 1.000 | 0.893 | 0.887 |
| Precision | 0.055 | 0.474 | 0.450 |
| F1 | 0.104 | 0.619 | 0.597 |
The tuned AdaBoost model is giving the best performance across all measures (excluding the tuned bagging model's 100% recall which was achieved by predicting all cases as failures, thereby making the model entirely useless).
# Calculate the final model's performance on the test set.
ada_tuned_perf_test = model_performance_classification_sklearn(ada_tuned, X_test, y_test)
print("Test performance:")
ada_tuned_perf_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.939 | 0.879 | 0.470 | 0.612 |
The tuned AdaBoost model is performing similarly on the test set compared to the validation set. This means the model should be generalizable for production purposes.
# Visualize importance of the features in the final model.
feature_names = X_train.columns
importances = ada_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
V18, V36 and V12 are the top 3 most important features according to the final model. V4, V20 and V17 are the least important features.
# Separate the variables to re-build the final model with a pipeline.
X_pipe = train0.drop(['Target'], axis=1)
Y_pipe = train0['Target']
# Separate the x & y variables in the test data set.
X_test_pipe = test0.drop(['Target'], axis=1)
y_test_pipe = test0['Target']
# Impute the missing values in the training and testing sets.
imputer = SimpleImputer(strategy="median")
X_pipe2 = imputer.fit_transform(X_pipe)
X_test_pipe2 = imputer.fit_transform(X_test_pipe)
# Apply the undersampling technique on the training data.
rus2 = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_pipe_un, y_pipe_un = rus2.fit_resample(X_pipe2, Y_pipe)
# Create the pipeline for the tuned AdaBoost model.
Model = Pipeline(steps=[("Ada", AdaBoostClassifier(
n_estimators=200,
learning_rate=0.2,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1)))])
# Fit the model on the undersampled training data.
Model.fit(X_pipe_un, y_pipe_un)
Pipeline(steps=[('Ada',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200,
random_state=1))])
# Check the model's performance on the test data.
Model_test = model_performance_classification_sklearn(Model, X_test_pipe2, y_test_pipe)
Model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953 | 0.885 | 0.541 | 0.671 |
The model is performing well on the test data, with very good recall of 88.5% and very high accuracy at 95.3%.
The tuned AdaBoost model should be implemented to help predict potential failures before they occur, thereby saving money by avoiding replacement costs wherever possible. V18, V36 and V12 are the most important features according to the model - the company should focus its attention on these features for prediction and maintenance purposes.
Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Import libraries for data manipulation.
import pandas as pd
import numpy as np
# Import libraries for data visualization.
import matplotlib.pyplot as plt
import seaborn as sns
# Import library for scaling data.
from sklearn.preprocessing import StandardScaler
# Import library to build K-means clustering model.
from sklearn.cluster import KMeans
# Import library to compute distances.
from scipy.spatial.distance import cdist
# Import library to compute silhouette scores.
from sklearn.metrics import silhouette_score
# Import library to visualize silhouette scores.
from yellowbrick.cluster import SilhouetteVisualizer
# Import linkage methods & cophenetic correlation for hierarchical clustering.
from scipy.cluster.hierarchy import linkage, cophenet
# Import library to compute pairwise distance.
from scipy.spatial.distance import pdist
# Import dendrogram to visualize hierarchical clustering.
from scipy.cluster.hierarchy import dendrogram
# Import library to build a hierarchical clustering model.
from sklearn.cluster import AgglomerativeClustering
# Set the precision of floating numbers to 3 decimal points.
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Load the data set.
df = pd.read_csv('/content/drive/My Drive/Data science/Data sets/stock_data.csv')
# Create a copy of the data set to avoid altering the original data.
data = df.copy()
# Use print function and .shape attribute to display the number of rows & columns in the data set.
print('There are',data.shape[0],'rows and',data.shape[1],'columns in the data set.')
There are 340 rows and 15 columns in the data set.
# Display a sample of 10 rows from the data set to get a general idea of the information & make sure it is loaded properly.
data.sample(10)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 111 | EQR | Equity Residential | Real Estate | REITs | 81.590 | 8.038 | 1.056 | 8 | 47 | 2196000 | 870120000 | 2.370 | 367139240.500 | 34.426 | -1.269 |
| 253 | PNR | Pentair Ltd. | Industrials | Industrial Machinery | 49.530 | -3.034 | 1.876 | 2 | 8 | 15900000 | -76400000 | -0.420 | 181904761.900 | 14.579 | -6.575 |
| 316 | VTR | Ventas Inc | Real Estate | REITs | 56.430 | 0.213 | 1.445 | 4 | 47 | -1803000 | 419222000 | 1.260 | 332715873.000 | 44.786 | -4.041 |
| 239 | PBCT | People's United Financial | Financials | Thrifts & Mortgage Finance | 16.150 | 3.129 | 1.133 | 5 | 99 | -298400000 | 260100000 | 0.860 | 302441860.500 | 18.779 | -0.427 |
| 81 | CTXS | Citrix Systems | Information Technology | Internet Software & Services | 75.650 | 9.021 | 1.969 | 16 | 52 | 108369000 | 319361000 | 2.010 | 158886069.700 | 37.637 | -1.765 |
| 334 | XYL | Xylem Inc. | Industrials | Industrial Conglomerates | 36.500 | 11.010 | 1.166 | 16 | 83 | 17000000 | 340000000 | 1.880 | 180851063.800 | 19.415 | 4.130 |
| 56 | CCI | Crown Castle International Corp. | Real Estate | REITs | 86.450 | 9.569 | 0.960 | 21 | 36 | 3190000 | 1520992000 | 4.440 | 342565765.800 | 19.471 | -10.667 |
| 331 | XOM | Exxon Mobil Corp. | Energy | Integrated Oil & Gas | 77.950 | 3.657 | 1.370 | 9 | 7 | -911000000 | 16150000000 | 3.850 | 4194805195.000 | 20.247 | -2.706 |
| 322 | WM | Waste Management Inc. | Industrials | Environmental Services | 53.370 | 7.061 | 0.940 | 14 | 2 | -1268000000 | 753000000 | 1.660 | 453614457.800 | 32.151 | -1.415 |
| 203 | MDLZ | Mondelez International | Consumer Staples | Packaged Foods & Meats | 44.840 | 6.080 | 1.322 | 26 | 17 | 239000000 | 7267000000 | 4.490 | 1618485523.000 | 9.987 | -12.810 |
# Use .info() method to check the non-null counts and data types of each column in the data set.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
None of the columns have null values.
Columns of object type: Ticker Symbol, Security, GICS Sector, GICS Sub Industry
Columns of numeric type: Current Price, Price Change, Volatility, ROE, Cash Ratio, Net Cash Flow, Net Income, Earnings Per Share, Estimated Shares Outstanding, P/E Ratio, P/B Ratio
# Check the statistical summary for the data set.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.000 | 80.862 | 98.055 | 4.500 | 38.555 | 59.705 | 92.880 | 1274.950 |
| Price Change | 340.000 | 4.078 | 12.006 | -47.130 | -0.939 | 4.820 | 10.695 | 55.052 |
| Volatility | 340.000 | 1.526 | 0.592 | 0.733 | 1.135 | 1.386 | 1.696 | 4.580 |
| ROE | 340.000 | 39.597 | 96.548 | 1.000 | 9.750 | 15.000 | 27.000 | 917.000 |
| Cash Ratio | 340.000 | 70.024 | 90.421 | 0.000 | 18.000 | 47.000 | 99.000 | 958.000 |
| Net Cash Flow | 340.000 | 55537620.588 | 1946365312.176 | -11208000000.000 | -193906500.000 | 2098000.000 | 169810750.000 | 20764000000.000 |
| Net Income | 340.000 | 1494384602.941 | 3940150279.328 | -23528000000.000 | 352301250.000 | 707336000.000 | 1899000000.000 | 24442000000.000 |
| Earnings Per Share | 340.000 | 2.777 | 6.588 | -61.200 | 1.558 | 2.895 | 4.620 | 50.090 |
| Estimated Shares Outstanding | 340.000 | 577028337.754 | 845849595.418 | 27672156.860 | 158848216.100 | 309675137.800 | 573117457.325 | 6159292035.000 |
| P/E Ratio | 340.000 | 32.613 | 44.349 | 2.935 | 15.045 | 20.820 | 31.765 | 528.039 |
| P/B Ratio | 340.000 | -1.718 | 13.967 | -76.119 | -4.352 | -1.067 | 3.917 | 129.065 |
The distributions of the variables will be described in further detail in the exploratory data analysis section below.
# Check for duplicated values in the data set.
print('There are',data.duplicated().sum(),'duplicated values in the data set.')
There are 0 duplicated values in the data set.
# Check for missing (null) values in the data set.
data.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
As shown above with the .info() method, there are no null values in the data set.
# Create a user-defined function to display histogram & boxplot together for numeric variables.
def histogram_boxplot(data, feature, figsize=(18, 5), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (18,5))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.3, 0.7)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="skyblue"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Visualize histogram & boxplot for stock prices.
sns.set_palette("Set3")
histogram_boxplot(data=data, feature='Current Price')
# Display the statistical summary for current stock price.
data.describe().T.loc['Current Price']
count 340.000 mean 80.862 std 98.055 min 4.500 25% 38.555 50% 59.705 75% 92.880 max 1274.950 Name: Current Price, dtype: float64
As suggested by the statistical summary, current stock price is skewed right with many outliers on the higher end. Notably, the maximum outlier stands alone at \$1274.95 with the next highest outlier falling just below \$700.
# Use a barplot to visualize the average price increase across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Price Change',
order=data.groupby(['GICS Sector'])['Price Change'].mean().reset_index().sort_values('Price Change')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average price change for each sector.
data.groupby(['GICS Sector'])['Price Change'].mean().sort_values(ascending=False)
GICS Sector Health Care 9.586 Consumer Staples 8.685 Information Technology 7.217 Telecommunications Services 6.957 Real Estate 6.206 Consumer Discretionary 5.846 Materials 5.590 Financials 3.865 Industrials 2.833 Utilities 0.804 Energy -10.228 Name: Price Change, dtype: float64
The health care sector has seen the highest average price increase at about 9.6% over 13 weeks. The next highest price increase is in the consumer staples sector with an 8.7% increase, followed by information technology with a 7.2% increase. The energy sector saw a significant drop in prices with a change of about -10%.
# Use heatmap to visualize correlation between numeric variables.
plt.figure(figsize=(15, 8))
sns.heatmap(data.corr(), annot=True);
# Use a for loop with logical indexing to create a DataFrame displaying the attributes with correlation above 0.3.
corr = pd.DataFrame()
for row in data.corr().index:
corr = pd.concat([corr, data.corr()[row][abs(data.corr()[row])>0.3].drop(row)], axis=1)
corr.dropna(axis=1,how='all')
| Current Price | Price Change | Volatility | ROE | Net Income | Earnings Per Share | Estimated Shares Outstanding | |
|---|---|---|---|---|---|---|---|
| Earnings Per Share | 0.480 | NaN | -0.379 | -0.405 | 0.558 | NaN | NaN |
| Volatility | NaN | -0.408 | NaN | NaN | -0.383 | -0.379 | NaN |
| Price Change | NaN | NaN | -0.408 | NaN | NaN | NaN | NaN |
| Net Income | NaN | NaN | -0.383 | NaN | NaN | 0.558 | 0.589 |
| Estimated Shares Outstanding | NaN | NaN | NaN | NaN | 0.589 | NaN | NaN |
| Current Price | NaN | NaN | NaN | NaN | NaN | 0.480 | NaN |
| ROE | NaN | NaN | NaN | NaN | NaN | -0.405 | NaN |
The highest positive correlation is between Net Income and Estimated Shares Outstanding at 0.589. The next highest correlation is between Net Income and Earnings Per Share at 0.558. Earnings Per Share is also moderately correlated with Current Price, with a correlation value of 0.48. There is a moderate negative correlation between Price Change and Volatility at -0.408, as well as between ROE and Earnings per Share at -0.405. There are also negative correlations between Volatility and Net Income at -0.383, and Volatility and Earnings per Share at -0.379.
# Use a barplot to visualize the average cash ratio across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Cash Ratio',
order=data.groupby(['GICS Sector'])['Cash Ratio'].mean().reset_index().sort_values('Cash Ratio')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average cash ratio for each sector.
data.groupby(['GICS Sector'])['Cash Ratio'].mean().sort_values(ascending=False)
GICS Sector Information Technology 149.818 Telecommunications Services 117.000 Health Care 103.775 Financials 98.592 Consumer Staples 70.947 Energy 51.133 Real Estate 50.111 Consumer Discretionary 49.575 Materials 41.700 Industrials 36.189 Utilities 13.625 Name: Cash Ratio, dtype: float64
The information technology sector has the highest cash ratio around 150, followed by telecommunications services at 117 then health care at 104. The utilities sector has the lowest cash ratio around 14, followed by industrials at 36 and materials at 41.7.
# Use a barplot to visualize the average P/E ratio across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='P/E Ratio',
order=data.groupby(['GICS Sector'])['P/E Ratio'].mean().reset_index().sort_values('P/E Ratio')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average P/E ratio for each sector.
data.groupby(['GICS Sector'])['P/E Ratio'].mean().sort_values(ascending=False)
GICS Sector Energy 72.898 Information Technology 43.783 Real Estate 43.066 Health Care 41.135 Consumer Discretionary 35.212 Consumer Staples 25.521 Materials 24.585 Utilities 18.719 Industrials 18.259 Financials 16.023 Telecommunications Services 12.223 Name: P/E Ratio, dtype: float64
The energy sector has the highest P/E ratio with a value of 72.9, followed by the information technology sector at 43.8 then real estate with a P/E ratio around 43. The sectors with the lowest P/E ratios are telecommunications services (12.2), financials (16) and industrials (18.3).
# Visualize the distribution of GICS Sector using countplot.
plt.figure(figsize=(18, 5))
plt.title('Countplot: GICS Sector')
sns.countplot(data=data, x ='GICS Sector', order=data['GICS Sector'].value_counts().reset_index()['index'].tolist())
plt.ylabel('Count')
plt.xticks(rotation=20);
# Display the count of companies in each sector using .value_counts().
data['GICS Sector'].value_counts()
Industrials 53 Financials 49 Health Care 40 Consumer Discretionary 40 Information Technology 33 Energy 30 Real Estate 27 Utilities 24 Materials 20 Consumer Staples 19 Telecommunications Services 5 Name: GICS Sector, dtype: int64
Industrials are the most represented sector in the data set with just above 50 companies, followed by financials just below 50 and health care with 40 companies. The least represented sector is telecommunications services with only 5 companies.
# Visualize the distribution of GICS Sub Industries using countplot.
plt.figure(figsize=(20, 5))
plt.title('Countplot: GICS Sub Industry')
sns.countplot(data=data, x ='GICS Sub Industry', order=data['GICS Sub Industry'].value_counts().reset_index()['index'].tolist())
plt.ylabel('Count')
plt.xticks(rotation=90);
# Display the most common sub-industries using value_counts().
data['GICS Sub Industry'].value_counts().head(8)
Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Electric Utilities 12 Internet Software & Services 12 Health Care Equipment 11 MultiUtilities 11 Banks 10 Name: GICS Sub Industry, dtype: int64
The most represented sub-industry is oil & gas exploration & production with 16 companies, followed by REITs (real estate investment trusts) and industrial conglomerates, each with 14 companies.
# Visualize histogram & boxplot for price change.
sns.set_palette("Set3")
histogram_boxplot(data=data, feature='Price Change')
# Display the statistical summary for price change.
data.describe().T.loc['Price Change']
count 340.000 mean 4.078 std 12.006 min -47.130 25% -0.939 50% 4.820 75% 10.695 max 55.052 Name: Price Change, dtype: float64
Price change appears to have a relatively normal distribution with the mean close to the median (4% vs. 5%) and outliers on both ends. There are a few more outliers on the low end compared to the high end.
# Visualize histogram & boxplot for volatility.
histogram_boxplot(data=data, feature='Volatility')
# Display the statistical summary for volatility.
data.describe().T.loc['Volatility']
count 340.000 mean 1.526 std 0.592 min 0.733 25% 1.135 50% 1.386 75% 1.696 max 4.580 Name: Volatility, dtype: float64
Volatility is heavily right-skewed with several outliers on the high end.
# Visualize histogram & boxplot for ROE.
histogram_boxplot(data=data, feature='ROE')
# Display the statistical summary for ROE.
data.describe().T.loc['ROE']
count 340.000 mean 39.597 std 96.548 min 1.000 25% 9.750 50% 15.000 75% 27.000 max 917.000 Name: ROE, dtype: float64
ROE is severely right-skewed will several extremely high outliers. The mean is significantly higher than the median and even falls above the 75th percentile.
# Visualize histogram & boxplot for Cash Ratio.
histogram_boxplot(data=data, feature='Cash Ratio')
# Display the statistical summary for cash ratio.
data.describe().T.loc['Cash Ratio']
count 340.000 mean 70.024 std 90.421 min 0.000 25% 18.000 50% 47.000 75% 99.000 max 958.000 Name: Cash Ratio, dtype: float64
data['Cash Ratio'].mode()
0 99 dtype: int64
Cash ratio is right-skewed with one outlier far above the rest. The most common value is zero.
# Visualize histogram & boxplot for Net Cash Flow.
histogram_boxplot(data=data, feature='Net Cash Flow')
# Display the statistical summary for net cash flow.
data.describe().T.loc['Net Cash Flow']
count 340.000 mean 55537620.588 std 1946365312.176 min -11208000000.000 25% -193906500.000 50% 2098000.000 75% 169810750.000 max 20764000000.000 Name: Net Cash Flow, dtype: float64
Net cash flow has a significant number of outliers and an extreme range from a minimum of -\$11,208,000,000 to a maximum of \$20,764,000,000. It appears to follow a somewhat normal distribution but with a very high peak near the middle.
# Visualize histogram & boxplot for Net Income.
histogram_boxplot(data=data, feature='Net Income')
# Display the statistical summary for net income.
data.describe().T.loc['Net Income']
count 340.000 mean 1494384602.941 std 3940150279.328 min -23528000000.000 25% 352301250.000 50% 707336000.000 75% 1899000000.000 max 24442000000.000 Name: Net Income, dtype: float64
Net income has a high peak near the median. It is overall right skewed due to a larger number of outliers on the high end making the mean higher than the median. There are also several outliers on the low end and a wide range between the minimum and maximum values.
# Visualize histogram & boxplot for Earnings Per Share.
histogram_boxplot(data=data, feature='Earnings Per Share')
# Display the statistical summary for earnings per share.
data.describe().T.loc['Earnings Per Share']
count 340.000 mean 2.777 std 6.588 min -61.200 25% 1.558 50% 2.895 75% 4.620 max 50.090 Name: Earnings Per Share, dtype: float64
The mean and median earnings per share are close together and the distribution appears roughly symmetrical. There are outliers on both the high and low ends.
# Visualize histogram & boxplot for Estimated Shares Outstanding.
histogram_boxplot(data=data, feature='Estimated Shares Outstanding')
# Display the statistical summary for estimated shares outstanding.
data.describe().T.loc['Estimated Shares Outstanding']
count 340.000 mean 577028337.754 std 845849595.418 min 27672156.860 25% 158848216.100 50% 309675137.800 75% 573117457.325 max 6159292035.000 Name: Estimated Shares Outstanding, dtype: float64
Estimated shares outstanding has a heavily right-skewed distribution with many outliers on the high end and none on the low end. The mean is above the median and even slightly higher than the 75th percentile due to the effect of the outliers. The most common values are near zero.
# Visualize histogram & boxplot for P/E Ratio.
histogram_boxplot(data=data, feature='P/E Ratio')
# Display the statistical summary for P/E ratio.
data.describe().T.loc['P/E Ratio']
count 340.000 mean 32.613 std 44.349 min 2.935 25% 15.045 50% 20.820 75% 31.765 max 528.039 Name: P/E Ratio, dtype: float64
The distribution for P/E ratio is somewhat multimodal with the highest peak near the median (around 21) and another smaller peak just below 100. It is right-skewed with a sizeable number of outliers on the high end bringing the mean above the median and just above the 75th percentile.
# Visualize histogram & boxplot for P/B Ratio.
histogram_boxplot(data=data, feature='P/B Ratio')
# Display the statistical summary for P/B ratio.
data.describe().T.loc['P/B Ratio']
count 340.000 mean -1.718 std 13.967 min -76.119 25% -4.352 50% -1.067 75% 3.917 max 129.065 Name: P/B Ratio, dtype: float64
The distribution of P/B ratio is roughly symmetric with the mean close to the median (-1.7 vs. -1) and outliers on both the high and low ends.
# Use a barplot to visualize the average current stock price across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Current Price',
order=data.groupby(['GICS Sector'])['Current Price'].mean().reset_index().sort_values('Current Price')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average stock price for each sector.
data.groupby(['GICS Sector'])['Current Price'].mean().sort_values(ascending=False)
GICS Sector Health Care 132.048 Consumer Discretionary 128.095 Real Estate 90.977 Materials 76.552 Industrials 74.412 Consumer Staples 71.973 Information Technology 63.548 Financials 58.659 Utilities 52.969 Energy 46.042 Telecommunications Services 32.964 Name: Current Price, dtype: float64
Health care and consumer discretionary sectors have the highest average stock price around \$130. The telecommunications services sector has the lowest stock price at \$33.
# Use a barplot to visualize the average volatility across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Volatility',
order=data.groupby(['GICS Sector'])['Volatility'].mean().reset_index().sort_values('Volatility')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average volatility for each sector.
data.groupby(['GICS Sector'])['Volatility'].mean().sort_values(ascending=False)
GICS Sector Energy 2.569 Materials 1.817 Information Technology 1.660 Consumer Discretionary 1.595 Health Care 1.541 Industrials 1.417 Telecommunications Services 1.342 Financials 1.267 Real Estate 1.206 Consumer Staples 1.153 Utilities 1.118 Name: Volatility, dtype: float64
The energy sector has the highest volatility around 2.6. Utilities sector has the lowest volatility of 1.1.
# Use a barplot to visualize the average ROE across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='ROE',
order=data.groupby(['GICS Sector'])['ROE'].mean().reset_index().sort_values('ROE')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average ROE for each sector.
data.groupby(['GICS Sector'])['ROE'].mean().sort_values(ascending=False)
GICS Sector Energy 93.200 Consumer Staples 89.421 Industrials 50.151 Consumer Discretionary 44.900 Materials 33.000 Telecommunications Services 32.600 Health Care 27.775 Financials 26.286 Information Technology 21.788 Real Estate 12.444 Utilities 9.875 Name: ROE, dtype: float64
Energy and consumer staples sectors have the highest average ROE around 90. Utilities and real estate have the lowest average ROE around 10.
# Use a barplot to visualize the average net cash flow across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Net Cash Flow',
order=data.groupby(['GICS Sector'])['Net Cash Flow'].mean().reset_index().sort_values('Net Cash Flow')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average net cash flow for each sector.
data.groupby(['GICS Sector'])['Net Cash Flow'].mean().sort_values(ascending=False)
GICS Sector Information Technology 483099121.212 Health Care 262687800.000 Consumer Staples 258627210.526 Financials 254356306.122 Utilities 176462291.667 Consumer Discretionary 84213175.000 Real Estate 3546703.704 Industrials -160103150.943 Materials -291236850.000 Energy -308318233.333 Telecommunications Services -1816800000.000 Name: Net Cash Flow, dtype: float64
The information technology sector has the highest net cash flow around \$483 million. Telecommunication services has a substantially lower cash flow compared to the other sectors at -$1.8 billion. Other sectors with negative cash flow include energy, materials and industrials.
# Use a barplot to visualize the average net income across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Net Income',
order=data.groupby(['GICS Sector'])['Net Income'].mean().reset_index().sort_values('Net Income')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average net income for each sector.
data.groupby(['GICS Sector'])['Net Income'].mean().sort_values(ascending=False)
GICS Sector Telecommunications Services 7067800000.000 Financials 3202677979.592 Consumer Staples 2518833052.632 Health Care 2018515350.000 Industrials 1722373113.208 Information Technology 1701587272.727 Consumer Discretionary 1373450075.000 Utilities 1107145541.667 Real Estate 567775740.741 Materials 278516500.000 Energy -2087527466.667 Name: Net Income, dtype: float64
The telecommunications services sector has the highest net income around \$7 billion, more than twice as much as the next highest net income of \$3.2 billion for the financials sector. Energy is the only sector with a negative average net income.
# Use a barplot to visualize the average earnings per share across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Earnings Per Share',
order=data.groupby(['GICS Sector'])['Earnings Per Share'].mean().reset_index().sort_values('Earnings Per Share')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average earnings per share for each sector.
data.groupby(['GICS Sector'])['Earnings Per Share'].mean().sort_values(ascending=False)
GICS Sector Health Care 4.541 Consumer Discretionary 4.526 Industrials 4.457 Financials 4.220 Telecommunications Services 3.550 Consumer Staples 3.224 Materials 3.129 Utilities 2.753 Real Estate 2.340 Information Technology 2.266 Energy -6.908 Name: Earnings Per Share, dtype: float64
The 3 sectors with the highest earnings per share are health care, consumer discretionary and industrials, each with earnings per share around \$4.50. Energy is the only sector with negative earnings per share at -$6.91.
# Use a barplot to visualize the average estimated shares outstanding across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='Estimated Shares Outstanding',
order=data.groupby(['GICS Sector'])['Estimated Shares Outstanding'].mean().reset_index().sort_values('Estimated Shares Outstanding')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average estimated shares outstanding for each sector.
data.groupby(['GICS Sector'])['Estimated Shares Outstanding'].mean().sort_values(ascending=False)
GICS Sector Telecommunications Services 2259575293.520 Consumer Staples 913685176.711 Information Technology 828278930.792 Health Care 684595513.887 Financials 683877334.853 Energy 663528783.071 Consumer Discretionary 399651258.324 Utilities 380919306.562 Industrials 354716946.844 Real Estate 344454019.890 Materials 308524571.474 Name: Estimated Shares Outstanding, dtype: float64
The telecommunications services sector has by far the highest average estimated shares outstanding around 2.3 billion, with more than twice as many outstanding shares compared to the runner-up, consumer staples. Sectors with the least outstanding shares include materials, real estate, industrials, utilities and consumer discretionary, each with fewer than 400 million outstanding shares.
# Use a barplot to visualize the average P/B ratio across different economic sectors.
plt.figure(figsize=(18, 5))
sns.barplot(data=data, x='GICS Sector', y='P/B Ratio',
order=data.groupby(['GICS Sector'])['P/B Ratio'].mean().reset_index().sort_values('P/B Ratio')['GICS Sector'].tolist())
plt.xticks(rotation=20);
# Use .groupby() function to display the average P/B ratio for each sector.
data.groupby(['GICS Sector'])['P/B Ratio'].mean().sort_values(ascending=False)
GICS Sector Information Technology 6.377 Energy 2.540 Materials 0.723 Health Care 0.069 Industrials -0.979 Real Estate -3.003 Utilities -3.087 Financials -4.271 Consumer Staples -4.554 Consumer Discretionary -8.254 Telecommunications Services -11.010 Name: P/B Ratio, dtype: float64
The information technology sector has the highest P/B ratio around 6.4, while telecommunications services has the lowest P/B ratio at -11.
In the data overview section above it was observed that there are no missing or duplicate values in the data set.
# Use boxplot to visualize outliers.
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There are outliers present in every numeric column. However, they should not be treated as they are proper values.
# Scale the data using z-score so that features of higher magnitude are not disproprotionally weighted in the clustering models.
num_col = data.select_dtypes(include=np.number).columns.tolist()
scaler = StandardScaler()
subset = data[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
# Create a DataFrame from the scaled data.
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df.sample(5)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 264 | 4.719 | 1.077 | 0.468 | -0.234 | 0.653 | 0.054 | -0.218 | 0.516 | -0.561 | 1.250 | 1.587 |
| 72 | -0.457 | -0.178 | -0.826 | -0.266 | -0.643 | -0.009 | -0.243 | -0.133 | -0.349 | -0.308 | 0.128 |
| 195 | 0.168 | 0.285 | -0.728 | 0.243 | 0.664 | 0.285 | 0.588 | 0.089 | 0.659 | -0.082 | 0.371 |
| 201 | 0.381 | 1.323 | -1.342 | 0.253 | 2.104 | 2.857 | 0.771 | 0.311 | 0.429 | -0.183 | 0.634 |
| 122 | 0.075 | 0.821 | -0.575 | -0.214 | -0.344 | -0.014 | -0.279 | -0.182 | -0.387 | 0.524 | -0.892 |
# Use a for loop to plot the error for each possible value of k (number of clusters).
clusters = range(2, 10)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20);
Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2692367155390745 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.1132726257719026 Number of Clusters: 6 Average Distortion: 2.0619418051649574 Number of Clusters: 7 Average Distortion: 2.0237293529159226 Number of Clusters: 8 Average Distortion: 1.993994405876358 Number of Clusters: 9 Average Distortion: 1.9270632295762185
The best "elbow" is not obvious as there are many slight bends. There appear to be slight elbows at k=4 and k=6 where the slope becomes noticeably more shallow.
# Visualize silhouette scores to help determine the best value of k.
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score);
For n_clusters = 2, silhouette score is 0.43969639509980457 For n_clusters = 3, silhouette score is 0.45797710447228496 For n_clusters = 4, silhouette score is 0.45434371948348606 For n_clusters = 5, silhouette score is 0.3536294114640298 For n_clusters = 6, silhouette score is 0.42153070977646057 For n_clusters = 7, silhouette score is 0.4103391826247439 For n_clusters = 8, silhouette score is 0.4089925983528137 For n_clusters = 9, silhouette score is 0.1384848848850714
K=4 has a higher silhouette score compared with k=6 which indicates that it may be the better choice.
# Use silhouette visualizer for more insight into the best value of k.
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show();
The presence of negative silhouette scores indicates that the model is failing to assign observations into the appropriate clusters. One of the clusters holds vastly more observations than the other 3.
# Check the silhouette scores in each cluster for the model where k=6.
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show();
Again there is a troubling number of negative silhouette coefficient values. The increase in number of clusters does not appear to add any value to the model. We will proceed cautiously with k=4.
# Fit the final K-means model with k=4.
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=4, random_state=0)
# Make copies of the original and scaled DataFrames to which cluster labels will be added.
og_data = data.copy()
scaled_data = subset_scaled_df.copy()
# Add K-means cluster labels to the original and scaled DataFrames.
og_data["K_means_segments"] = kmeans.labels_
scaled_data["K_means_segments"] = kmeans.labels_
# Use groupby function to display cluster profiles.
cluster_profile = og_data.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segment"] = (og_data.groupby("K_means_segments")["Security"].count().values)
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1554926560.000000 | 1572611680.000000 | 6.045200 | 578316318.948800 | 74.960824 | 14.402452 | 25 |
| 1 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -159428481.481481 | -3887457740.740741 | -9.473704 | 480398572.845926 | 90.619220 | 1.342067 | 27 |
| 2 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1072272727.272727 | 14833090909.090910 | 4.154545 | 4298826628.727273 | 14.803577 | -4.552119 | 11 |
| 3 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -14046223.826715 | 1482212389.891697 | 3.621029 | 438533835.667184 | 23.843656 | -3.358948 | 277 |
# Display boxplots to compare features across clusters.
plt.figure(figsize=(20, 15))
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=og_data, x="K_means_segments", y=variable)
plt.tight_layout(pad=2.0)
# Use a barplot to visualize and compare the averages of each numeric feature across different clusters.
scaled_data.groupby("K_means_segments").mean().plot.bar(figsize=(15, 6));
Cluster 0 features:
Cluster 1 features:
Cluster 2 features:
Cluster 3 features:
# Use groupby function to see how GICS sectors are represented across the clusters.
og_data.groupby(['K_means_segments', 'GICS Sector'])['Security'].count()
K_means_segments GICS Sector
0 Consumer Discretionary 6
Consumer Staples 1
Energy 1
Financials 1
Health Care 9
Information Technology 5
Real Estate 1
Telecommunications Services 1
1 Energy 22
Industrials 1
Information Technology 3
Materials 1
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 2
Information Technology 1
Telecommunications Services 2
3 Consumer Discretionary 33
Consumer Staples 17
Energy 6
Financials 45
Health Care 29
Industrials 52
Information Technology 24
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
Name: Security, dtype: int64
Cluster 3 has a large majority of observations overall. The most represented sector in cluster 0 is the health care sector, although a higher number of observations from the health care sector are present in cluster 3. Cluster 1 has the highest number of observations from the energy sector. Cluster 2 is the smallest cluster overall with only a few observations from a small number of sectors.
hc_data = subset_scaled_df.copy()
# Compute cophenetic correlation for various distance measures.
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_data, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_data))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
# Print the combination of distance metric and linkage method with the highest cophenetic correlation.
print("Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
# Explore different linkage methods with Euclidean distance.
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_data, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_data))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404.
# Print the combination of distance metric and linkage method with the highest cophenetic correlation.
print("Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
The cophenetic correlation is maximized with Euclidean distance and average linkage.
# Visualize dendrograms for various linkage methods.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# Create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# Enumerate through the list of linkage methods above, plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_data, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_data))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
# Build the hierarchical clustering model.
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="average")
HCmodel.fit(hc_data)
AgglomerativeClustering(linkage='average', n_clusters=4)
# Make a copy of the original (un-scaled) data to add hierarchical clustering labels.
og_hc_data = data.copy()
hc_data['HC_Clusters'] = HCmodel.labels_
og_hc_data['HC_Clusters'] = HCmodel.labels_
# Use groupby function to display cluster profiles.
cluster_profile_hc = og_hc_data.groupby("HC_Clusters").mean()
cluster_profile_hc["count_in_each_segment"] = (og_hc_data.groupby("HC_Clusters")["Security"].count().values)
cluster_profile_hc.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 77.573266 | 4.148438 | 1.515708 | 35.184524 | 67.154762 | 67104693.452381 | 1607391086.309524 | 2.905640 | 572317821.413095 | 32.325679 | -1.762402 | 336 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
# Display boxplots to compare features across clusters.
plt.figure(figsize=(20, 15))
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=og_hc_data, x="HC_Clusters", y=variable)
plt.tight_layout(pad=2.0)
# Use a barplot to visualize and compare the averages of each numeric feature across different clusters.
hc_data.groupby("HC_Clusters").mean().plot.bar(figsize=(15, 6));
# Use groupby function to see how GICS sectors are represented across the clusters.
og_hc_data.groupby(['HC_Clusters', 'GICS Sector'])['Security'].count()
HC_Clusters GICS Sector
0 Consumer Discretionary 39
Consumer Staples 19
Energy 28
Financials 49
Health Care 40
Industrials 53
Information Technology 32
Materials 20
Real Estate 27
Telecommunications Services 5
Utilities 24
1 Consumer Discretionary 1
2 Energy 2
3 Information Technology 1
Name: Security, dtype: int64
Cluster 0 contains the vast majority of observations. Compared to the other clusters, all feature values lie near the middle on average. Cluster 1 only contains one observation which is from the consumer discretionary sector, with a very high current price and high earnings per share. Cluster 2 contains only 2 observations, both from the energy sector, which are distinguished by relatively high volatility, high ROE, and low values for both net income and earnings per share. Cluster 3 only contains one observation which is from the information technology sector and has a very high cash ratio as well as relatively high number of estimated shares outstanding compared to the other clusters.
There was not a noticeable difference in the time to execute K-means vs. hierarchical clustering models. However for a larger data set hierarchical clustering would be expected to require significantly more computational power / time.
The K-means technique seems to have resulted in more distinct clusters. For the most part each cluster has a unique average value for every feature.
Both models ended up with one cluster being vastly larger than the others. The hierarchical model took this to the extreme, with 3 of the 4 clusters each having only 1 or 2 observations. The k-means model resulted in the largest cluster having 277 observations, then 3 smaller clusters each with 27, 25, and 11 observations respectively. Because of this difference the k-means model seems more robust and preferable over the hierarchical clustering model.
Both algorithms resulted in 4 final clusters.
The main similarity between the cluster profiles is that each clustering technique resulted in one cluster having "medium" values across all features. Besides that, the features defining each cluster are quite different.